2025-05-07T20:23:26.0473677Z Current runner version: '2.323.0' 2025-05-07T20:23:26.0479361Z Runner name: 'i-050728826a2d12e7e' 2025-05-07T20:23:26.0480277Z Machine name: 'ip-10-0-27-143' 2025-05-07T20:23:26.0482959Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:23:26.0485209Z Contents: read 2025-05-07T20:23:26.0485727Z Metadata: read 2025-05-07T20:23:26.0486212Z Packages: read 2025-05-07T20:23:26.0486696Z ##[endgroup] 2025-05-07T20:23:26.0488531Z Secret source: None 2025-05-07T20:23:26.0489151Z Prepare workflow directory 2025-05-07T20:23:26.1403938Z Prepare all required actions 2025-05-07T20:23:26.1445625Z Getting action download info 2025-05-07T20:23:26.3842065Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:23:26.6621605Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:23:27.0102283Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:23:28.6667594Z Getting action download info 2025-05-07T20:23:28.7699875Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:23:28.9864348Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.12, 12.6.3, 12.6.3, gcc) 2025-05-07T20:23:29.0459097Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:23:29.0593199Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:23:29.0605717Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:29.0607207Z ##[endgroup] 2025-05-07T20:23:30.0173772Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:23:30.0174175Z Instance Type: g5.4xlarge 2025-05-07T20:23:30.0174421Z AMI Name: unknown 2025-05-07T20:23:30.0211216Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:23:35.3980825Z ##[group]Run actions/checkout@v4 2025-05-07T20:23:35.3981139Z with: 2025-05-07T20:23:35.3981391Z submodules: true 2025-05-07T20:23:35.3981628Z repository: pytorch/FBGEMM 2025-05-07T20:23:35.3982026Z token: *** 2025-05-07T20:23:35.3982228Z ssh-strict: true 2025-05-07T20:23:35.3982445Z ssh-user: git 2025-05-07T20:23:35.3982671Z persist-credentials: true 2025-05-07T20:23:35.3982928Z clean: true 2025-05-07T20:23:35.3983164Z sparse-checkout-cone-mode: true 2025-05-07T20:23:35.3983430Z fetch-depth: 1 2025-05-07T20:23:35.3983647Z fetch-tags: false 2025-05-07T20:23:35.3983864Z show-progress: true 2025-05-07T20:23:35.3984091Z lfs: false 2025-05-07T20:23:35.3984300Z set-safe-directory: true 2025-05-07T20:23:35.3984563Z env: 2025-05-07T20:23:35.3984779Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:35.3985091Z BUILD_ENV: build_binary 2025-05-07T20:23:35.3985356Z BUILD_TARGET: genai 2025-05-07T20:23:35.3985584Z BUILD_VARIANT: cuda 2025-05-07T20:23:35.3985850Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:35.3986099Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:35.3986337Z ##[endgroup] 2025-05-07T20:23:35.5144827Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:23:35.5146010Z ##[group]Getting Git version info 2025-05-07T20:23:35.5146454Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:23:35.5147064Z [command]/usr/bin/git version 2025-05-07T20:23:35.5147325Z git version 2.47.1 2025-05-07T20:23:35.5165640Z ##[endgroup] 2025-05-07T20:23:35.5179163Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/3562e8cb-07d7-40de-aedb-7c23eadca378' before making global git config changes 2025-05-07T20:23:35.5180065Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:23:35.5192952Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:35.5233747Z [command]/usr/bin/git config --local --get remote.origin.url 2025-05-07T20:23:35.5257178Z https://github.com/pytorch/FBGEMM 2025-05-07T20:23:35.5275909Z ##[group]Removing previously created refs, to avoid conflicts 2025-05-07T20:23:35.5281541Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-05-07T20:23:35.5306758Z refs/heads/main 2025-05-07T20:23:35.5315762Z [command]/usr/bin/git checkout --detach 2025-05-07T20:23:36.3968003Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:36.4019283Z [command]/usr/bin/git branch --delete --force main 2025-05-07T20:23:36.4045534Z Deleted branch main (was b6b2ce3). 2025-05-07T20:23:36.4051837Z ##[endgroup] 2025-05-07T20:23:36.4054648Z [command]/usr/bin/git submodule status 2025-05-07T20:23:36.4475230Z e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b) 2025-05-07T20:23:36.4562697Z 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd) 2025-05-07T20:23:36.4650213Z 6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec) 2025-05-07T20:23:36.4740711Z 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e) 2025-05-07T20:23:36.4826550Z f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77) 2025-05-07T20:23:36.4911784Z 420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844) 2025-05-07T20:23:36.4997148Z 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280) 2025-05-07T20:23:36.5010571Z ##[group]Cleaning the repository 2025-05-07T20:23:36.5015799Z [command]/usr/bin/git clean -ffdx 2025-05-07T20:23:36.5074428Z [command]/usr/bin/git reset --hard HEAD 2025-05-07T20:23:36.5181013Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:36.5188029Z ##[endgroup] 2025-05-07T20:23:36.5190110Z ##[group]Disabling automatic garbage collection 2025-05-07T20:23:36.5194792Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:23:36.5226330Z ##[endgroup] 2025-05-07T20:23:36.5226726Z ##[group]Setting up auth 2025-05-07T20:23:36.5232416Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:23:36.5275512Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:23:36.5607798Z Entering 'external/asmjit' 2025-05-07T20:23:36.5674802Z Entering 'external/composable_kernel' 2025-05-07T20:23:36.5746658Z Entering 'external/cpuinfo' 2025-05-07T20:23:36.5813926Z Entering 'external/cutlass' 2025-05-07T20:23:36.5888311Z Entering 'external/googletest' 2025-05-07T20:23:36.5952453Z Entering 'external/hipify_torch' 2025-05-07T20:23:36.6017237Z Entering 'external/json' 2025-05-07T20:23:36.6103114Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:23:36.6135732Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:23:36.6467634Z Entering 'external/asmjit' 2025-05-07T20:23:36.6531995Z Entering 'external/composable_kernel' 2025-05-07T20:23:36.6604255Z Entering 'external/cpuinfo' 2025-05-07T20:23:36.6668576Z Entering 'external/cutlass' 2025-05-07T20:23:36.6743696Z Entering 'external/googletest' 2025-05-07T20:23:36.6812035Z Entering 'external/hipify_torch' 2025-05-07T20:23:36.6875658Z Entering 'external/json' 2025-05-07T20:23:36.6963727Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:36.7014623Z ##[endgroup] 2025-05-07T20:23:36.7015052Z ##[group]Fetching the repository 2025-05-07T20:23:36.7021960Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:23:36.9346192Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:23:36.9346693Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:23:36.9372855Z ##[endgroup] 2025-05-07T20:23:36.9373349Z ##[group]Determining the checkout info 2025-05-07T20:23:36.9374518Z ##[endgroup] 2025-05-07T20:23:36.9379208Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:23:36.9431507Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:23:36.9460139Z ##[group]Checking out the ref 2025-05-07T20:23:36.9464299Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:23:36.9585733Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:36.9589074Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:23:36.9598413Z ##[endgroup] 2025-05-07T20:23:36.9598819Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:23:36.9604124Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:36.9651775Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:23:36.9682474Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:23:36.9713482Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:23:36.9741845Z ##[endgroup] 2025-05-07T20:23:36.9742215Z ##[group]Fetching submodules 2025-05-07T20:23:36.9744981Z [command]/usr/bin/git submodule sync 2025-05-07T20:23:37.0119346Z Synchronizing submodule url for 'external/asmjit' 2025-05-07T20:23:37.0119815Z Synchronizing submodule url for 'external/composable_kernel' 2025-05-07T20:23:37.0120243Z Synchronizing submodule url for 'external/cpuinfo' 2025-05-07T20:23:37.0120616Z Synchronizing submodule url for 'external/cutlass' 2025-05-07T20:23:37.0121303Z Synchronizing submodule url for 'external/googletest' 2025-05-07T20:23:37.0121712Z Synchronizing submodule url for 'external/hipify_torch' 2025-05-07T20:23:37.0122112Z Synchronizing submodule url for 'external/json' 2025-05-07T20:23:37.0136214Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:23:37.0563707Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:23:37.0714627Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:23:37.0817950Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:23:37.0987344Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:23:37.1077753Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:23:37.1158493Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:23:37.1259796Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:23:37.1277242Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:23:37.1610184Z Entering 'external/asmjit' 2025-05-07T20:23:37.1642109Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.1674353Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.1706473Z Entering 'external/cutlass' 2025-05-07T20:23:37.1738616Z Entering 'external/googletest' 2025-05-07T20:23:37.1772060Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.1804193Z Entering 'external/json' 2025-05-07T20:23:37.1850415Z ##[endgroup] 2025-05-07T20:23:37.1850818Z ##[group]Persisting credentials for submodules 2025-05-07T20:23:37.1857394Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:23:37.2188087Z Entering 'external/asmjit' 2025-05-07T20:23:37.2231187Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2231867Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2275257Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.2317789Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2318249Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2368649Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.2410082Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2410513Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2457314Z Entering 'external/cutlass' 2025-05-07T20:23:37.2502545Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2502990Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2554641Z Entering 'external/googletest' 2025-05-07T20:23:37.2599392Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2599829Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2643652Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.2688611Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2689057Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2730874Z Entering 'external/json' 2025-05-07T20:23:37.2777939Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2778387Z url.https://github.com/.insteadof 2025-05-07T20:23:37.2838096Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:23:37.3171432Z Entering 'external/asmjit' 2025-05-07T20:23:37.3232707Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:23:37.3235233Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.3296643Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:23:37.3299516Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.3365377Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:23:37.3366496Z Entering 'external/cutlass' 2025-05-07T20:23:37.3424169Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:23:37.3426834Z Entering 'external/googletest' 2025-05-07T20:23:37.3487691Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:23:37.3491443Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.3550257Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:23:37.3552976Z Entering 'external/json' 2025-05-07T20:23:37.3614946Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:23:37.3738953Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:23:37.4071367Z Entering 'external/asmjit' 2025-05-07T20:23:37.4103394Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.4135821Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.4170342Z Entering 'external/cutlass' 2025-05-07T20:23:37.4201676Z Entering 'external/googletest' 2025-05-07T20:23:37.4234650Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.4268866Z Entering 'external/json' 2025-05-07T20:23:37.4317023Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:23:37.4647914Z Entering 'external/asmjit' 2025-05-07T20:23:37.4681838Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.4714549Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.4746342Z Entering 'external/cutlass' 2025-05-07T20:23:37.4778727Z Entering 'external/googletest' 2025-05-07T20:23:37.4810902Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.4843179Z Entering 'external/json' 2025-05-07T20:23:37.4888829Z ##[endgroup] 2025-05-07T20:23:37.4929165Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:23:37.4955711Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:37.5143510Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:23:37.5143835Z with: 2025-05-07T20:23:37.5144072Z name: fbgemm_genai_x86_gcc_py3.12_cu12.6.3.whl 2025-05-07T20:23:37.5144382Z merge-multiple: false 2025-05-07T20:23:37.5144639Z repository: pytorch/FBGEMM 2025-05-07T20:23:37.5144890Z run-id: 14891846252 2025-05-07T20:23:37.5145106Z env: 2025-05-07T20:23:37.5145326Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:37.5145616Z BUILD_ENV: build_binary 2025-05-07T20:23:37.5145862Z BUILD_TARGET: genai 2025-05-07T20:23:37.5146078Z BUILD_VARIANT: cuda 2025-05-07T20:23:37.5146309Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:37.5146558Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:37.5146792Z ##[endgroup] 2025-05-07T20:23:37.7509955Z Downloading single artifact 2025-05-07T20:23:37.9190612Z Preparing to download the following artifacts: 2025-05-07T20:23:37.9191431Z - fbgemm_genai_x86_gcc_py3.12_cu12.6.3.whl (ID: 3081362852, Size: 12511372, Expected Digest: sha256:fda2094d8736a8502a6727b9a5f7a5a78f8048753893d498f4d03c0c6fa9ef69) 2025-05-07T20:23:37.9729964Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-ec3a2fd8-75ec-5d2c-a37b-ee6ee19c88ae/artifacts/768a04041691747daab1e752da2c135b903b31da5ee0699a6f825976517e0bc8.zip 2025-05-07T20:23:37.9731358Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:38.0567044Z (node:68266) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:23:38.0567968Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:23:38.2734282Z SHA256 digest of downloaded artifact is fda2094d8736a8502a6727b9a5f7a5a78f8048753893d498f4d03c0c6fa9ef69 2025-05-07T20:23:38.2734947Z Artifact download completed successfully. 2025-05-07T20:23:38.2735329Z Total of 1 artifact(s) downloaded 2025-05-07T20:23:38.2741342Z Download artifact has finished successfully 2025-05-07T20:23:38.2984994Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:23:38.2985383Z with: 2025-05-07T20:23:38.2985597Z driver-version: 570.133.07 2025-05-07T20:23:38.2985836Z env: 2025-05-07T20:23:38.2986057Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.2986360Z BUILD_ENV: build_binary 2025-05-07T20:23:38.2986598Z BUILD_TARGET: genai 2025-05-07T20:23:38.2986830Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.2987067Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:38.2987325Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.2987552Z ##[endgroup] 2025-05-07T20:23:38.3084044Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:23:38.3084426Z with: 2025-05-07T20:23:38.3084630Z timeout_minutes: 10 2025-05-07T20:23:38.3084866Z max_attempts: 3 2025-05-07T20:23:38.3107789Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:23:38.3130826Z retry_wait_seconds: 10 2025-05-07T20:23:38.3131088Z polling_interval_seconds: 1 2025-05-07T20:23:38.3131348Z warning_on_retry: true 2025-05-07T20:23:38.3131595Z continue_on_error: false 2025-05-07T20:23:38.3132070Z env: 2025-05-07T20:23:38.3132347Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.3132757Z BUILD_ENV: build_binary 2025-05-07T20:23:38.3133192Z BUILD_TARGET: genai 2025-05-07T20:23:38.3133502Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.3150434Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:38.3150708Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.3150950Z DRIVER_VERSION: 570.133.07 2025-05-07T20:23:38.3151189Z ##[endgroup] 2025-05-07T20:23:38.3850324Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:23:38.3851809Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:23:38.3853859Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:23:38.7325488Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:23:38.7326209Z No packages marked for removal. 2025-05-07T20:23:38.7388758Z Dependencies resolved. 2025-05-07T20:23:38.7398427Z Nothing to do. 2025-05-07T20:23:38.7398914Z Complete! 2025-05-07T20:23:38.7745432Z + install_nvidia_driver_common 2025-05-07T20:23:38.7749454Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:23:38.7749748Z + lspci 2025-05-07T20:23:38.7751508Z Before installing NVIDIA driver 2025-05-07T20:23:38.7875067Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:38.7876001Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:38.7876543Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:38.7877048Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:38.7877515Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:38.7878027Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:38.7878489Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:38.7878958Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:38.7879397Z + lsmod 2025-05-07T20:23:38.7924245Z Module Size Used by 2025-05-07T20:23:38.7924558Z xt_nat 16384 0 2025-05-07T20:23:38.7924832Z nvidia_modeset 1716224 0 2025-05-07T20:23:38.7925189Z video 65536 1 nvidia_modeset 2025-05-07T20:23:38.7925556Z wmi 36864 1 video 2025-05-07T20:23:38.7925830Z nvidia_uvm 1884160 0 2025-05-07T20:23:38.7926130Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:38.7926451Z drm 602112 1 nvidia 2025-05-07T20:23:38.7926758Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:38.7927119Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:38.7927458Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:38.7927742Z veth 36864 0 2025-05-07T20:23:38.7927997Z xt_conntrack 16384 1 2025-05-07T20:23:38.7928249Z nft_chain_nat 16384 3 2025-05-07T20:23:38.7928533Z xt_MASQUERADE 20480 1 2025-05-07T20:23:38.7928845Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:38.7929186Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:38.7930053Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:38.7930506Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:38.7930819Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:38.7931120Z xfrm_user 57344 1 2025-05-07T20:23:38.7931378Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:38.7931665Z xt_addrtype 16384 2 2025-05-07T20:23:38.7931920Z nft_compat 20480 4 2025-05-07T20:23:38.7932218Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:38.7932619Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:38.7932984Z br_netfilter 36864 0 2025-05-07T20:23:38.7933340Z bridge 323584 1 br_netfilter 2025-05-07T20:23:38.7933630Z stp 16384 1 bridge 2025-05-07T20:23:38.7933912Z llc 16384 2 bridge,stp 2025-05-07T20:23:38.7934194Z overlay 167936 0 2025-05-07T20:23:38.7934444Z tls 135168 0 2025-05-07T20:23:38.7934695Z nls_ascii 16384 1 2025-05-07T20:23:38.7934945Z nls_cp437 20480 1 2025-05-07T20:23:38.7935186Z vfat 24576 1 2025-05-07T20:23:38.7935433Z fat 86016 1 vfat 2025-05-07T20:23:38.7935699Z sunrpc 696320 1 2025-05-07T20:23:38.7935936Z i8042 45056 0 2025-05-07T20:23:38.7936176Z ena 180224 0 2025-05-07T20:23:38.7936426Z serio 28672 3 i8042 2025-05-07T20:23:38.7936695Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:38.7936955Z button 24576 0 2025-05-07T20:23:38.7937207Z sch_fq_codel 20480 17 2025-05-07T20:23:38.7937454Z dm_mod 188416 0 2025-05-07T20:23:38.7937705Z fuse 163840 1 2025-05-07T20:23:38.7937949Z loop 36864 0 2025-05-07T20:23:38.7938191Z configfs 57344 1 2025-05-07T20:23:38.7938448Z dax 45056 1 dm_mod 2025-05-07T20:23:38.7938725Z dmi_sysfs 20480 0 2025-05-07T20:23:38.7939124Z crc32_pclmul 16384 0 2025-05-07T20:23:38.7939378Z crc32c_intel 24576 0 2025-05-07T20:23:38.7939630Z efivarfs 24576 1 2025-05-07T20:23:38.7939915Z + modinfo nvidia 2025-05-07T20:23:38.7942891Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:38.7943499Z import_ns: DMA_BUF 2025-05-07T20:23:38.7943759Z alias: char-major-195-* 2025-05-07T20:23:38.7944029Z version: 570.133.07 2025-05-07T20:23:38.7944278Z supported: external 2025-05-07T20:23:38.7944522Z license: Dual MIT/GPL 2025-05-07T20:23:38.7944815Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:38.7945215Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:38.7945586Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:38.7945909Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:38.7946261Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:38.7946599Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:38.7946918Z depends: i2c-core,drm 2025-05-07T20:23:38.7947202Z retpoline: Y 2025-05-07T20:23:38.7947424Z name: nvidia 2025-05-07T20:23:38.7947787Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:38.7948258Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:38.7948802Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:38.7949267Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:38.7949575Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:38.7949879Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:38.7950187Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:38.7950491Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:38.7950925Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:38.7951291Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:38.7951778Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:38.7952116Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:38.7952426Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:38.7952728Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:38.7953093Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:38.7953490Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:38.7953866Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:38.7954280Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:38.7954689Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:38.7955186Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:38.7955601Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:38.7955944Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:38.7956312Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:38.7956677Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:38.7957009Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:38.7957331Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:38.7957654Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:38.7957977Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:38.7958288Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:38.7958629Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:38.7958990Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:38.7959555Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:38.7959882Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:38.7960221Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:38.7960558Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:38.7960908Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:38.7961391Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:38.7961680Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:38.7962001Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:38.7962319Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:38.7962633Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:38.7962961Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:38.7963305Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:38.7963651Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:38.7963977Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:38.7964327Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:38.7964658Z parm: rm_firmware_active:charp 2025-05-07T20:23:38.7964953Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:23:38.7965195Z ++ command -v nvidia-smi 2025-05-07T20:23:38.7965452Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:23:38.7965706Z + set +e 2025-05-07T20:23:38.7966021Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:38.8184049Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:38.8184443Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:38.8184765Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:38.8185188Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:38.8185510Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:38.8185939Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:38.8186403Z + set -e 2025-05-07T20:23:38.8186603Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:38.8187004Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:38.8187461Z + post_install_nvidia_driver_common 2025-05-07T20:23:38.8190890Z + sudo modprobe nvidia 2025-05-07T20:23:39.0061911Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:39.0062756Z + lspci 2025-05-07T20:23:39.0063008Z After installing NVIDIA driver 2025-05-07T20:23:39.0180567Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:39.0181059Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:39.0181599Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:39.0182099Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:39.0182570Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:39.0183210Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:39.0183937Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:39.0184403Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:39.0184798Z + lsmod 2025-05-07T20:23:39.0218563Z Module Size Used by 2025-05-07T20:23:39.0218898Z xt_nat 16384 0 2025-05-07T20:23:39.0219196Z nvidia_modeset 1716224 0 2025-05-07T20:23:39.0219597Z video 65536 1 nvidia_modeset 2025-05-07T20:23:39.0219948Z wmi 36864 1 video 2025-05-07T20:23:39.0220217Z nvidia_uvm 1884160 0 2025-05-07T20:23:39.0220526Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:39.0220853Z drm 602112 1 nvidia 2025-05-07T20:23:39.0221152Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:39.0221540Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:39.0221880Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:39.0222163Z veth 36864 0 2025-05-07T20:23:39.0222409Z xt_conntrack 16384 1 2025-05-07T20:23:39.0222665Z nft_chain_nat 16384 3 2025-05-07T20:23:39.0222918Z xt_MASQUERADE 20480 1 2025-05-07T20:23:39.0223220Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:39.0223566Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:39.0224217Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:39.0224665Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:39.0224973Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:39.0225263Z xfrm_user 57344 1 2025-05-07T20:23:39.0225524Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:39.0225811Z xt_addrtype 16384 2 2025-05-07T20:23:39.0226066Z nft_compat 20480 4 2025-05-07T20:23:39.0226365Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:39.0226761Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:39.0227126Z br_netfilter 36864 0 2025-05-07T20:23:39.0227401Z bridge 323584 1 br_netfilter 2025-05-07T20:23:39.0227681Z stp 16384 1 bridge 2025-05-07T20:23:39.0227961Z llc 16384 2 bridge,stp 2025-05-07T20:23:39.0228241Z overlay 167936 0 2025-05-07T20:23:39.0228487Z tls 135168 0 2025-05-07T20:23:39.0228784Z nls_ascii 16384 1 2025-05-07T20:23:39.0229034Z nls_cp437 20480 1 2025-05-07T20:23:39.0229273Z vfat 24576 1 2025-05-07T20:23:39.0229521Z fat 86016 1 vfat 2025-05-07T20:23:39.0229782Z sunrpc 696320 1 2025-05-07T20:23:39.0230023Z i8042 45056 0 2025-05-07T20:23:39.0230257Z ena 180224 0 2025-05-07T20:23:39.0230506Z serio 28672 3 i8042 2025-05-07T20:23:39.0230783Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:39.0231033Z button 24576 0 2025-05-07T20:23:39.0231288Z sch_fq_codel 20480 17 2025-05-07T20:23:39.0231545Z dm_mod 188416 0 2025-05-07T20:23:39.0231785Z fuse 163840 1 2025-05-07T20:23:39.0232031Z loop 36864 0 2025-05-07T20:23:39.0232435Z configfs 57344 1 2025-05-07T20:23:39.0232684Z dax 45056 1 dm_mod 2025-05-07T20:23:39.0232964Z dmi_sysfs 20480 0 2025-05-07T20:23:39.0233218Z crc32_pclmul 16384 0 2025-05-07T20:23:39.0233463Z crc32c_intel 24576 0 2025-05-07T20:23:39.0233711Z efivarfs 24576 1 2025-05-07T20:23:39.0233958Z + modinfo nvidia 2025-05-07T20:23:39.0236836Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:39.0237430Z import_ns: DMA_BUF 2025-05-07T20:23:39.0237704Z alias: char-major-195-* 2025-05-07T20:23:39.0237970Z version: 570.133.07 2025-05-07T20:23:39.0238205Z supported: external 2025-05-07T20:23:39.0238455Z license: Dual MIT/GPL 2025-05-07T20:23:39.0238736Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:39.0239068Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:39.0239373Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:39.0239691Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:39.0240023Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:39.0240346Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:39.0240651Z depends: i2c-core,drm 2025-05-07T20:23:39.0240906Z retpoline: Y 2025-05-07T20:23:39.0241113Z name: nvidia 2025-05-07T20:23:39.0241462Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:39.0241920Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:39.0242353Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:39.0242750Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:39.0243053Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:39.0243347Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:39.0243649Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:39.0243945Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:39.0244247Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:39.0244714Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:39.0245099Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:39.0245420Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:39.0245714Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:39.0246011Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:39.0246366Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:39.0246757Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:39.0247124Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:39.0247528Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.0247926Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:39.0248329Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.0248741Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:39.0249080Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:39.0249433Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:39.0249794Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:39.0250123Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:39.0250484Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:39.0250803Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:39.0251122Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:39.0251425Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:39.0251758Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:39.0252116Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:39.0252437Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:39.0252757Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:39.0253098Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:39.0253622Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:39.0253963Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:39.0254280Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:39.0254563Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:39.0254876Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:39.0255185Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:39.0255496Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:39.0255817Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:39.0256191Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:39.0256528Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:39.0256843Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:39.0257170Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:39.0257498Z parm: rm_firmware_active:charp 2025-05-07T20:23:39.0257779Z + set +e 2025-05-07T20:23:39.0257978Z + nvidia-smi 2025-05-07T20:23:39.0417266Z Wed May 7 20:23:39 2025 2025-05-07T20:23:39.0417647Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:39.0418303Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:39.0418859Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:39.0419340Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:39.0419855Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:39.0420281Z | | | MIG M. | 2025-05-07T20:23:39.0420607Z |=========================================+========================+======================| 2025-05-07T20:23:39.0555349Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:39.0556076Z | 0% 25C P8 9W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:39.0556522Z | | | N/A | 2025-05-07T20:23:39.0556911Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:39.0560390Z 2025-05-07T20:23:39.0560877Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:39.0561386Z | Processes: | 2025-05-07T20:23:39.0561816Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:39.0562242Z | ID ID Usage | 2025-05-07T20:23:39.0562601Z |=========================================================================================| 2025-05-07T20:23:39.0566837Z | No running processes found | 2025-05-07T20:23:39.0567441Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:39.3226346Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:39.3388909Z NVIDIA A10G 2025-05-07T20:23:39.3430952Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:39.3432550Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:39.3432885Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:39.3433287Z + set -e 2025-05-07T20:23:39.3433571Z INFO: Ignoring allowed status 0 2025-05-07T20:23:39.3440758Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:39.3444164Z + sudo yum install -y yum-utils 2025-05-07T20:23:39.8106041Z Last metadata expiration check: 0:09:36 ago on Wed May 7 20:14:03 2025. 2025-05-07T20:23:39.8352298Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:39.8758444Z Dependencies resolved. 2025-05-07T20:23:39.8931931Z Nothing to do. 2025-05-07T20:23:39.8932273Z Complete! 2025-05-07T20:23:39.9322933Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:39.9323550Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:39.9324392Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:40.2654948Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:40.3239177Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:40.8277145Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:40.8523894Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:40.8529598Z Package nvidia-container-toolkit-1.16.2-1.x86_64 is already installed. 2025-05-07T20:23:40.8919685Z Dependencies resolved. 2025-05-07T20:23:40.9100779Z Nothing to do. 2025-05-07T20:23:40.9101591Z Complete! 2025-05-07T20:23:40.9485332Z + sudo systemctl restart docker 2025-05-07T20:23:43.4732466Z nvidia-persistenced failed to initialize. Check syslog for more details. 2025-05-07T20:23:43.4931186Z Wed May 7 20:23:43 2025 2025-05-07T20:23:43.4931678Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:43.4932189Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:43.4932677Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:43.4933260Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:43.4933793Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:43.4934248Z | | | MIG M. | 2025-05-07T20:23:43.4934964Z |=========================================+========================+======================| 2025-05-07T20:23:43.5062885Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:43.5063455Z | 0% 26C P8 9W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:43.5063835Z | | | N/A | 2025-05-07T20:23:43.5064229Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:43.5067881Z 2025-05-07T20:23:43.5068394Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:43.5068971Z | Processes: | 2025-05-07T20:23:43.5069461Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:43.5069859Z | ID ID Usage | 2025-05-07T20:23:43.5070207Z |=========================================================================================| 2025-05-07T20:23:43.5073173Z | No running processes found | 2025-05-07T20:23:43.5073815Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:44.3666886Z Command completed after 1 attempt(s). 2025-05-07T20:23:44.3751344Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:44.3751804Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:44.3766017Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:44.3766362Z env: 2025-05-07T20:23:44.3766769Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:44.3767060Z BUILD_ENV: build_binary 2025-05-07T20:23:44.3767318Z BUILD_TARGET: genai 2025-05-07T20:23:44.3767554Z BUILD_VARIANT: cuda 2025-05-07T20:23:44.3767785Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:44.3768046Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:44.3768351Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:44.3768678Z ##[endgroup] 2025-05-07T20:23:44.7163384Z ################################################################################ 2025-05-07T20:23:44.7163761Z # Print System Info 2025-05-07T20:23:44.7163978Z # 2025-05-07T20:23:44.7180192Z # [2025-05-07T20:23:44.717Z] + print_system_info 2025-05-07T20:23:44.7180616Z ################################################################################ 2025-05-07T20:23:44.7180889Z 2025-05-07T20:23:44.7181014Z ################################################################################ 2025-05-07T20:23:44.7181356Z [INFO] Printing environment variables ... 2025-05-07T20:23:44.7181675Z + printenv 2025-05-07T20:23:44.7181789Z 2025-05-07T20:23:44.7218109Z SHELL=/bin/bash 2025-05-07T20:23:44.7218719Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:44.7219258Z BUILD_VARIANT=cuda 2025-05-07T20:23:44.7219965Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_51729451-ec85-4350-b577-611513ad2ac8 2025-05-07T20:23:44.7220682Z GITHUB_ACTION=__run 2025-05-07T20:23:44.7220960Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:44.7221300Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:44.7221549Z RUNNER_NAME=i-050728826a2d12e7e 2025-05-07T20:23:44.7221870Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:44.7222173Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:44.7222438Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:44.7222813Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:44.7223225Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:44.7223504Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:44.7223802Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:44.7224258Z *** 2025-05-07T20:23:44.7224482Z LOGNAME=ec2-user 2025-05-07T20:23:44.7224719Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:44.7224973Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:44.7225194Z GITHUB_ACTIONS=true 2025-05-07T20:23:44.7225415Z SYSTEMD_EXEC_PID=55527 2025-05-07T20:23:44.7225690Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:44.7226222Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:44.7226724Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:44.7226998Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:44.7227248Z RUNNER_OS=Linux 2025-05-07T20:23:44.7227469Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:44.7227710Z HOME=/home/ec2-user 2025-05-07T20:23:44.7227962Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:44.7228240Z LANG=C.UTF-8 2025-05-07T20:23:44.7228532Z RUNNER_TRACKING_ID=github_1c494cfc-f0a5-47f9-8949-f21ca7f48e65 2025-05-07T20:23:44.7228887Z RUNNER_ARCH=X64 2025-05-07T20:23:44.7229151Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:44.7229470Z BUILD_TARGET=genai 2025-05-07T20:23:44.7229979Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_51729451-ec85-4350-b577-611513ad2ac8 2025-05-07T20:23:44.7230802Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_51729451-ec85-4350-b577-611513ad2ac8 2025-05-07T20:23:44.7231512Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:44.7232406Z INVOCATION_ID=2d655c04c2b34aecaea14cccbfda1e33 2025-05-07T20:23:44.7232735Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:44.7232993Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:44.7233551Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_51729451-ec85-4350-b577-611513ad2ac8 2025-05-07T20:23:44.7234303Z BUILD_ENV=build_binary 2025-05-07T20:23:44.7234527Z GITHUB_ACTOR=q10 2025-05-07T20:23:44.7234743Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:44.7234962Z KERN_NAME_LC=linux 2025-05-07T20:23:44.7235181Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:44.7235478Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:44.7235812Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:44.7236050Z USER=ec2-user 2025-05-07T20:23:44.7236283Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:44.7236569Z SHLVL=1 2025-05-07T20:23:44.7236767Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:44.7237067Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:44.7237500Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:44.7237853Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:44.7238088Z KERN_NAME=Linux 2025-05-07T20:23:44.7238321Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:44.7238726Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:44.7239139Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:44.7239413Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:44.7239652Z JOURNAL_STREAM=8:85602 2025-05-07T20:23:44.7239953Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:44.7240314Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:44.7240621Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:44.7240946Z GITHUB_BASE_REF=main 2025-05-07T20:23:44.7241162Z CI=true 2025-05-07T20:23:44.7241372Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:44.7241651Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:44.7241918Z GITHUB_ACTION_REF= 2025-05-07T20:23:44.7242163Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:44.7242745Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_51729451-ec85-4350-b577-611513ad2ac8 2025-05-07T20:23:44.7243297Z MACHINE_NAME=x86_64 2025-05-07T20:23:44.7243516Z _=/usr/bin/printenv 2025-05-07T20:23:44.7243647Z 2025-05-07T20:23:44.7243777Z ################################################################################ 2025-05-07T20:23:44.7244086Z [INFO] Print ldd version ... 2025-05-07T20:23:44.7244332Z + ldd --version 2025-05-07T20:23:44.7244471Z 2025-05-07T20:23:44.7244572Z ldd (GNU libc) 2.34 2025-05-07T20:23:44.7244838Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:44.7245270Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:44.7245791Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:44.7246228Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:44.7246441Z 2025-05-07T20:23:44.7246568Z ################################################################################ 2025-05-07T20:23:44.7246873Z [INFO] Print CPU info ... 2025-05-07T20:23:44.7247111Z + nproc 2025-05-07T20:23:44.7247221Z 2025-05-07T20:23:44.7265901Z 16 2025-05-07T20:23:44.7267584Z 2025-05-07T20:23:44.7267721Z + lscpu 2025-05-07T20:23:44.7399242Z 2025-05-07T20:23:44.7399417Z Architecture: x86_64 2025-05-07T20:23:44.7399810Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:44.7400291Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7400692Z Byte Order: Little Endian 2025-05-07T20:23:44.7400999Z CPU(s): 16 2025-05-07T20:23:44.7401296Z On-line CPU(s) list: 0-15 2025-05-07T20:23:44.7401612Z Vendor ID: AuthenticAMD 2025-05-07T20:23:44.7401940Z Model name: AMD EPYC 7R32 2025-05-07T20:23:44.7402254Z CPU family: 23 2025-05-07T20:23:44.7404416Z Model: 49 2025-05-07T20:23:44.7404713Z Thread(s) per core: 2 2025-05-07T20:23:44.7405001Z Core(s) per socket: 8 2025-05-07T20:23:44.7405288Z Socket(s): 1 2025-05-07T20:23:44.7405681Z Stepping: 0 2025-05-07T20:23:44.7405977Z BogoMIPS: 5599.99 2025-05-07T20:23:44.7408022Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7410041Z Hypervisor vendor: KVM 2025-05-07T20:23:44.7410353Z Virtualization type: full 2025-05-07T20:23:44.7410692Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:44.7411046Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:44.7411409Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:44.7411760Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:44.7412073Z NUMA node(s): 1 2025-05-07T20:23:44.7412364Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:44.7412698Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:44.7413061Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:44.7413549Z Vulnerability L1tf: Not affected 2025-05-07T20:23:44.7414046Z Vulnerability Mds: Not affected 2025-05-07T20:23:44.7414550Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:44.7415052Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:44.7415574Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:44.7416336Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:44.7417117Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:44.7417746Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:44.7418412Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:44.7419246Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:44.7419902Z Vulnerability Srbds: Not affected 2025-05-07T20:23:44.7420258Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:44.7420572Z 2025-05-07T20:23:44.7420663Z + cat /proc/cpuinfo 2025-05-07T20:23:44.7420796Z 2025-05-07T20:23:44.7420886Z processor : 0 2025-05-07T20:23:44.7421101Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7421371Z cpu family : 23 2025-05-07T20:23:44.7421599Z model : 49 2025-05-07T20:23:44.7421807Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7422049Z stepping : 0 2025-05-07T20:23:44.7422262Z microcode : 0x830107f 2025-05-07T20:23:44.7422483Z cpu MHz : 2077.110 2025-05-07T20:23:44.7422698Z cache size : 512 KB 2025-05-07T20:23:44.7422914Z physical id : 0 2025-05-07T20:23:44.7423119Z siblings : 16 2025-05-07T20:23:44.7423319Z core id : 0 2025-05-07T20:23:44.7423523Z cpu cores : 8 2025-05-07T20:23:44.7423720Z apicid : 0 2025-05-07T20:23:44.7423922Z initial apicid : 0 2025-05-07T20:23:44.7424144Z fpu : yes 2025-05-07T20:23:44.7424338Z fpu_exception : yes 2025-05-07T20:23:44.7424556Z cpuid level : 13 2025-05-07T20:23:44.7424761Z wp : yes 2025-05-07T20:23:44.7426812Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7429128Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7429605Z bogomips : 5599.99 2025-05-07T20:23:44.7429828Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7430070Z clflush size : 64 2025-05-07T20:23:44.7430284Z cache_alignment : 64 2025-05-07T20:23:44.7430566Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7430887Z power management: 2025-05-07T20:23:44.7431018Z 2025-05-07T20:23:44.7431103Z processor : 1 2025-05-07T20:23:44.7431321Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7431561Z cpu family : 23 2025-05-07T20:23:44.7431781Z model : 49 2025-05-07T20:23:44.7432026Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7432270Z stepping : 0 2025-05-07T20:23:44.7432473Z microcode : 0x830107f 2025-05-07T20:23:44.7432704Z cpu MHz : 2921.142 2025-05-07T20:23:44.7432919Z cache size : 512 KB 2025-05-07T20:23:44.7433131Z physical id : 0 2025-05-07T20:23:44.7433331Z siblings : 16 2025-05-07T20:23:44.7433535Z core id : 1 2025-05-07T20:23:44.7433736Z cpu cores : 8 2025-05-07T20:23:44.7433929Z apicid : 2 2025-05-07T20:23:44.7434126Z initial apicid : 2 2025-05-07T20:23:44.7434341Z fpu : yes 2025-05-07T20:23:44.7434541Z fpu_exception : yes 2025-05-07T20:23:44.7434758Z cpuid level : 13 2025-05-07T20:23:44.7434967Z wp : yes 2025-05-07T20:23:44.7436892Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7439084Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7439571Z bogomips : 5599.99 2025-05-07T20:23:44.7439799Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7440031Z clflush size : 64 2025-05-07T20:23:44.7440245Z cache_alignment : 64 2025-05-07T20:23:44.7440514Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7440825Z power management: 2025-05-07T20:23:44.7440962Z 2025-05-07T20:23:44.7441049Z processor : 2 2025-05-07T20:23:44.7441267Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7441507Z cpu family : 23 2025-05-07T20:23:44.7441706Z model : 49 2025-05-07T20:23:44.7441918Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7442195Z stepping : 0 2025-05-07T20:23:44.7442424Z microcode : 0x830107f 2025-05-07T20:23:44.7442660Z cpu MHz : 2915.953 2025-05-07T20:23:44.7442885Z cache size : 512 KB 2025-05-07T20:23:44.7443095Z physical id : 0 2025-05-07T20:23:44.7443311Z siblings : 16 2025-05-07T20:23:44.7443519Z core id : 2 2025-05-07T20:23:44.7443720Z cpu cores : 8 2025-05-07T20:23:44.7443927Z apicid : 4 2025-05-07T20:23:44.7444134Z initial apicid : 4 2025-05-07T20:23:44.7444343Z fpu : yes 2025-05-07T20:23:44.7444548Z fpu_exception : yes 2025-05-07T20:23:44.7444778Z cpuid level : 13 2025-05-07T20:23:44.7444979Z wp : yes 2025-05-07T20:23:44.7446981Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7449233Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7449718Z bogomips : 5599.99 2025-05-07T20:23:44.7449941Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7450174Z clflush size : 64 2025-05-07T20:23:44.7450398Z cache_alignment : 64 2025-05-07T20:23:44.7450672Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7450974Z power management: 2025-05-07T20:23:44.7451111Z 2025-05-07T20:23:44.7451197Z processor : 3 2025-05-07T20:23:44.7451437Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7451694Z cpu family : 23 2025-05-07T20:23:44.7451900Z model : 49 2025-05-07T20:23:44.7452110Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7452348Z stepping : 0 2025-05-07T20:23:44.7452554Z microcode : 0x830107f 2025-05-07T20:23:44.7452786Z cpu MHz : 3301.992 2025-05-07T20:23:44.7452994Z cache size : 512 KB 2025-05-07T20:23:44.7453310Z physical id : 0 2025-05-07T20:23:44.7453522Z siblings : 16 2025-05-07T20:23:44.7453716Z core id : 3 2025-05-07T20:23:44.7453914Z cpu cores : 8 2025-05-07T20:23:44.7454113Z apicid : 6 2025-05-07T20:23:44.7454309Z initial apicid : 6 2025-05-07T20:23:44.7454522Z fpu : yes 2025-05-07T20:23:44.7454727Z fpu_exception : yes 2025-05-07T20:23:44.7454948Z cpuid level : 13 2025-05-07T20:23:44.7455157Z wp : yes 2025-05-07T20:23:44.7457088Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7459708Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7460200Z bogomips : 5599.99 2025-05-07T20:23:44.7460416Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7460658Z clflush size : 64 2025-05-07T20:23:44.7460878Z cache_alignment : 64 2025-05-07T20:23:44.7461146Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7461459Z power management: 2025-05-07T20:23:44.7461590Z 2025-05-07T20:23:44.7461696Z processor : 4 2025-05-07T20:23:44.7461907Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7462148Z cpu family : 23 2025-05-07T20:23:44.7462362Z model : 49 2025-05-07T20:23:44.7509484Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7509807Z stepping : 0 2025-05-07T20:23:44.7510022Z microcode : 0x830107f 2025-05-07T20:23:44.7510303Z cpu MHz : 3038.535 2025-05-07T20:23:44.7510554Z cache size : 512 KB 2025-05-07T20:23:44.7510807Z physical id : 0 2025-05-07T20:23:44.7511053Z siblings : 16 2025-05-07T20:23:44.7511324Z core id : 4 2025-05-07T20:23:44.7511528Z cpu cores : 8 2025-05-07T20:23:44.7511732Z apicid : 8 2025-05-07T20:23:44.7511969Z initial apicid : 8 2025-05-07T20:23:44.7512189Z fpu : yes 2025-05-07T20:23:44.7512449Z fpu_exception : yes 2025-05-07T20:23:44.7512668Z cpuid level : 13 2025-05-07T20:23:44.7512875Z wp : yes 2025-05-07T20:23:44.7515095Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7517289Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7517884Z bogomips : 5599.99 2025-05-07T20:23:44.7518105Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7518347Z clflush size : 64 2025-05-07T20:23:44.7518552Z cache_alignment : 64 2025-05-07T20:23:44.7518822Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7519139Z power management: 2025-05-07T20:23:44.7519272Z 2025-05-07T20:23:44.7519353Z processor : 5 2025-05-07T20:23:44.7519570Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7519809Z cpu family : 23 2025-05-07T20:23:44.7520017Z model : 49 2025-05-07T20:23:44.7520223Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7520468Z stepping : 0 2025-05-07T20:23:44.7520668Z microcode : 0x830107f 2025-05-07T20:23:44.7520893Z cpu MHz : 3302.234 2025-05-07T20:23:44.7521100Z cache size : 512 KB 2025-05-07T20:23:44.7521309Z physical id : 0 2025-05-07T20:23:44.7521516Z siblings : 16 2025-05-07T20:23:44.7521713Z core id : 5 2025-05-07T20:23:44.7521945Z cpu cores : 8 2025-05-07T20:23:44.7522160Z apicid : 10 2025-05-07T20:23:44.7522356Z initial apicid : 10 2025-05-07T20:23:44.7522560Z fpu : yes 2025-05-07T20:23:44.7522759Z fpu_exception : yes 2025-05-07T20:23:44.7522977Z cpuid level : 13 2025-05-07T20:23:44.7523177Z wp : yes 2025-05-07T20:23:44.7525088Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7527255Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7527734Z bogomips : 5599.99 2025-05-07T20:23:44.7527948Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7528177Z clflush size : 64 2025-05-07T20:23:44.7528388Z cache_alignment : 64 2025-05-07T20:23:44.7528654Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7528958Z power management: 2025-05-07T20:23:44.7529094Z 2025-05-07T20:23:44.7529179Z processor : 6 2025-05-07T20:23:44.7529392Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7529622Z cpu family : 23 2025-05-07T20:23:44.7529825Z model : 49 2025-05-07T20:23:44.7530035Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7530276Z stepping : 0 2025-05-07T20:23:44.7530479Z microcode : 0x830107f 2025-05-07T20:23:44.7530702Z cpu MHz : 2144.336 2025-05-07T20:23:44.7530907Z cache size : 512 KB 2025-05-07T20:23:44.7531122Z physical id : 0 2025-05-07T20:23:44.7531331Z siblings : 16 2025-05-07T20:23:44.7531523Z core id : 6 2025-05-07T20:23:44.7531722Z cpu cores : 8 2025-05-07T20:23:44.7531919Z apicid : 12 2025-05-07T20:23:44.7532131Z initial apicid : 12 2025-05-07T20:23:44.7532336Z fpu : yes 2025-05-07T20:23:44.7532531Z fpu_exception : yes 2025-05-07T20:23:44.7532755Z cpuid level : 13 2025-05-07T20:23:44.7532957Z wp : yes 2025-05-07T20:23:44.7535003Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7537162Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7537641Z bogomips : 5599.99 2025-05-07T20:23:44.7537925Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7538162Z clflush size : 64 2025-05-07T20:23:44.7538376Z cache_alignment : 64 2025-05-07T20:23:44.7538637Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7538949Z power management: 2025-05-07T20:23:44.7539085Z 2025-05-07T20:23:44.7539213Z processor : 7 2025-05-07T20:23:44.7539501Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7539764Z cpu family : 23 2025-05-07T20:23:44.7539966Z model : 49 2025-05-07T20:23:44.7540168Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7540400Z stepping : 0 2025-05-07T20:23:44.7540601Z microcode : 0x830107f 2025-05-07T20:23:44.7540819Z cpu MHz : 2133.022 2025-05-07T20:23:44.7541030Z cache size : 512 KB 2025-05-07T20:23:44.7541240Z physical id : 0 2025-05-07T20:23:44.7541448Z siblings : 16 2025-05-07T20:23:44.7541643Z core id : 7 2025-05-07T20:23:44.7541869Z cpu cores : 8 2025-05-07T20:23:44.7542094Z apicid : 14 2025-05-07T20:23:44.7542291Z initial apicid : 14 2025-05-07T20:23:44.7542502Z fpu : yes 2025-05-07T20:23:44.7542709Z fpu_exception : yes 2025-05-07T20:23:44.7542909Z cpuid level : 13 2025-05-07T20:23:44.7543111Z wp : yes 2025-05-07T20:23:44.7545012Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7547168Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7547633Z bogomips : 5599.99 2025-05-07T20:23:44.7547856Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7548094Z clflush size : 64 2025-05-07T20:23:44.7548307Z cache_alignment : 64 2025-05-07T20:23:44.7548567Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7548883Z power management: 2025-05-07T20:23:44.7549010Z 2025-05-07T20:23:44.7549094Z processor : 8 2025-05-07T20:23:44.7549301Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7549533Z cpu family : 23 2025-05-07T20:23:44.7549732Z model : 49 2025-05-07T20:23:44.7549930Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7550164Z stepping : 0 2025-05-07T20:23:44.7550369Z microcode : 0x830107f 2025-05-07T20:23:44.7550580Z cpu MHz : 3217.811 2025-05-07T20:23:44.7550779Z cache size : 512 KB 2025-05-07T20:23:44.7550996Z physical id : 0 2025-05-07T20:23:44.7551206Z siblings : 16 2025-05-07T20:23:44.7551414Z core id : 0 2025-05-07T20:23:44.7551616Z cpu cores : 8 2025-05-07T20:23:44.7551805Z apicid : 1 2025-05-07T20:23:44.7552032Z initial apicid : 1 2025-05-07T20:23:44.7552256Z fpu : yes 2025-05-07T20:23:44.7552451Z fpu_exception : yes 2025-05-07T20:23:44.7552667Z cpuid level : 13 2025-05-07T20:23:44.7552866Z wp : yes 2025-05-07T20:23:44.7554760Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7557040Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7557508Z bogomips : 5599.99 2025-05-07T20:23:44.7557720Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7557953Z clflush size : 64 2025-05-07T20:23:44.7558165Z cache_alignment : 64 2025-05-07T20:23:44.7558503Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7558807Z power management: 2025-05-07T20:23:44.7558939Z 2025-05-07T20:23:44.7559019Z processor : 9 2025-05-07T20:23:44.7560369Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7560618Z cpu family : 23 2025-05-07T20:23:44.7560820Z model : 49 2025-05-07T20:23:44.7561018Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7561255Z stepping : 0 2025-05-07T20:23:44.7561458Z microcode : 0x830107f 2025-05-07T20:23:44.7561676Z cpu MHz : 2209.047 2025-05-07T20:23:44.7561890Z cache size : 512 KB 2025-05-07T20:23:44.7562102Z physical id : 0 2025-05-07T20:23:44.7562300Z siblings : 16 2025-05-07T20:23:44.7562491Z core id : 1 2025-05-07T20:23:44.7562691Z cpu cores : 8 2025-05-07T20:23:44.7562886Z apicid : 3 2025-05-07T20:23:44.7563073Z initial apicid : 3 2025-05-07T20:23:44.7563282Z fpu : yes 2025-05-07T20:23:44.7563477Z fpu_exception : yes 2025-05-07T20:23:44.7563684Z cpuid level : 13 2025-05-07T20:23:44.7563886Z wp : yes 2025-05-07T20:23:44.7565784Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7567949Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7568415Z bogomips : 5599.99 2025-05-07T20:23:44.7568633Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7568862Z clflush size : 64 2025-05-07T20:23:44.7569067Z cache_alignment : 64 2025-05-07T20:23:44.7569325Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7569636Z power management: 2025-05-07T20:23:44.7569762Z 2025-05-07T20:23:44.7569848Z processor : 10 2025-05-07T20:23:44.7570056Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7570291Z cpu family : 23 2025-05-07T20:23:44.7570492Z model : 49 2025-05-07T20:23:44.7570689Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7570931Z stepping : 0 2025-05-07T20:23:44.7571130Z microcode : 0x830107f 2025-05-07T20:23:44.7571343Z cpu MHz : 2969.520 2025-05-07T20:23:44.7571553Z cache size : 512 KB 2025-05-07T20:23:44.7571783Z physical id : 0 2025-05-07T20:23:44.7572007Z siblings : 16 2025-05-07T20:23:44.7572201Z core id : 2 2025-05-07T20:23:44.7572394Z cpu cores : 8 2025-05-07T20:23:44.7572582Z apicid : 5 2025-05-07T20:23:44.7572789Z initial apicid : 5 2025-05-07T20:23:44.7572994Z fpu : yes 2025-05-07T20:23:44.7573245Z fpu_exception : yes 2025-05-07T20:23:44.7573454Z cpuid level : 13 2025-05-07T20:23:44.7573653Z wp : yes 2025-05-07T20:23:44.7575544Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7577703Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7578173Z bogomips : 5599.99 2025-05-07T20:23:44.7578570Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7578810Z clflush size : 64 2025-05-07T20:23:44.7579015Z cache_alignment : 64 2025-05-07T20:23:44.7579278Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7579585Z power management: 2025-05-07T20:23:44.7579710Z 2025-05-07T20:23:44.7579987Z processor : 11 2025-05-07T20:23:44.7580202Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7580433Z cpu family : 23 2025-05-07T20:23:44.7580639Z model : 49 2025-05-07T20:23:44.7580841Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7581072Z stepping : 0 2025-05-07T20:23:44.7581279Z microcode : 0x830107f 2025-05-07T20:23:44.7581501Z cpu MHz : 3299.173 2025-05-07T20:23:44.7581702Z cache size : 512 KB 2025-05-07T20:23:44.7581913Z physical id : 0 2025-05-07T20:23:44.7582152Z siblings : 16 2025-05-07T20:23:44.7582361Z core id : 3 2025-05-07T20:23:44.7582554Z cpu cores : 8 2025-05-07T20:23:44.7582752Z apicid : 7 2025-05-07T20:23:44.7582939Z initial apicid : 7 2025-05-07T20:23:44.7583153Z fpu : yes 2025-05-07T20:23:44.7583345Z fpu_exception : yes 2025-05-07T20:23:44.7583555Z cpuid level : 13 2025-05-07T20:23:44.7583754Z wp : yes 2025-05-07T20:23:44.7585647Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7587805Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7588274Z bogomips : 5599.99 2025-05-07T20:23:44.7588479Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7588718Z clflush size : 64 2025-05-07T20:23:44.7588926Z cache_alignment : 64 2025-05-07T20:23:44.7589183Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7589491Z power management: 2025-05-07T20:23:44.7589617Z 2025-05-07T20:23:44.7589705Z processor : 12 2025-05-07T20:23:44.7589910Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7590148Z cpu family : 23 2025-05-07T20:23:44.7590346Z model : 49 2025-05-07T20:23:44.7590546Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7590775Z stepping : 0 2025-05-07T20:23:44.7590980Z microcode : 0x830107f 2025-05-07T20:23:44.7591205Z cpu MHz : 3138.651 2025-05-07T20:23:44.7591408Z cache size : 512 KB 2025-05-07T20:23:44.7591612Z physical id : 0 2025-05-07T20:23:44.7591817Z siblings : 16 2025-05-07T20:23:44.7592009Z core id : 4 2025-05-07T20:23:44.7592199Z cpu cores : 8 2025-05-07T20:23:44.7592390Z apicid : 9 2025-05-07T20:23:44.7592574Z initial apicid : 9 2025-05-07T20:23:44.7592776Z fpu : yes 2025-05-07T20:23:44.7592973Z fpu_exception : yes 2025-05-07T20:23:44.7593187Z cpuid level : 13 2025-05-07T20:23:44.7593392Z wp : yes 2025-05-07T20:23:44.7595292Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7597445Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7597907Z bogomips : 5599.99 2025-05-07T20:23:44.7598118Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7598349Z clflush size : 64 2025-05-07T20:23:44.7598560Z cache_alignment : 64 2025-05-07T20:23:44.7598918Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7599227Z power management: 2025-05-07T20:23:44.7599353Z 2025-05-07T20:23:44.7599438Z processor : 13 2025-05-07T20:23:44.7599640Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7599868Z cpu family : 23 2025-05-07T20:23:44.7600152Z model : 49 2025-05-07T20:23:44.7600345Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7600582Z stepping : 0 2025-05-07T20:23:44.7600784Z microcode : 0x830107f 2025-05-07T20:23:44.7600996Z cpu MHz : 3259.052 2025-05-07T20:23:44.7601204Z cache size : 512 KB 2025-05-07T20:23:44.7601415Z physical id : 0 2025-05-07T20:23:44.7601612Z siblings : 16 2025-05-07T20:23:44.7601803Z core id : 5 2025-05-07T20:23:44.7602003Z cpu cores : 8 2025-05-07T20:23:44.7602194Z apicid : 11 2025-05-07T20:23:44.7602397Z initial apicid : 11 2025-05-07T20:23:44.7602605Z fpu : yes 2025-05-07T20:23:44.7602796Z fpu_exception : yes 2025-05-07T20:23:44.7603008Z cpuid level : 13 2025-05-07T20:23:44.7603211Z wp : yes 2025-05-07T20:23:44.7605125Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7607283Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7607748Z bogomips : 5599.99 2025-05-07T20:23:44.7607963Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7608192Z clflush size : 64 2025-05-07T20:23:44.7608398Z cache_alignment : 64 2025-05-07T20:23:44.7608663Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7608982Z power management: 2025-05-07T20:23:44.7609110Z 2025-05-07T20:23:44.7609190Z processor : 14 2025-05-07T20:23:44.7609399Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7609633Z cpu family : 23 2025-05-07T20:23:44.7609829Z model : 49 2025-05-07T20:23:44.7610030Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7610266Z stepping : 0 2025-05-07T20:23:44.7610462Z microcode : 0x830107f 2025-05-07T20:23:44.7610684Z cpu MHz : 2231.959 2025-05-07T20:23:44.7610890Z cache size : 512 KB 2025-05-07T20:23:44.7611095Z physical id : 0 2025-05-07T20:23:44.7611301Z siblings : 16 2025-05-07T20:23:44.7611499Z core id : 6 2025-05-07T20:23:44.7611690Z cpu cores : 8 2025-05-07T20:23:44.7611914Z apicid : 13 2025-05-07T20:23:44.7612137Z initial apicid : 13 2025-05-07T20:23:44.7612345Z fpu : yes 2025-05-07T20:23:44.7612537Z fpu_exception : yes 2025-05-07T20:23:44.7612748Z cpuid level : 13 2025-05-07T20:23:44.7612946Z wp : yes 2025-05-07T20:23:44.7614883Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7617038Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7617506Z bogomips : 5599.99 2025-05-07T20:23:44.7617723Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7617945Z clflush size : 64 2025-05-07T20:23:44.7618157Z cache_alignment : 64 2025-05-07T20:23:44.7618429Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7618728Z power management: 2025-05-07T20:23:44.7618860Z 2025-05-07T20:23:44.7619037Z processor : 15 2025-05-07T20:23:44.7619246Z vendor_id : AuthenticAMD 2025-05-07T20:23:44.7619474Z cpu family : 23 2025-05-07T20:23:44.7619671Z model : 49 2025-05-07T20:23:44.7619872Z model name : AMD EPYC 7R32 2025-05-07T20:23:44.7620109Z stepping : 0 2025-05-07T20:23:44.7620306Z microcode : 0x830107f 2025-05-07T20:23:44.7620606Z cpu MHz : 3056.931 2025-05-07T20:23:44.7620812Z cache size : 512 KB 2025-05-07T20:23:44.7621016Z physical id : 0 2025-05-07T20:23:44.7621215Z siblings : 16 2025-05-07T20:23:44.7621414Z core id : 7 2025-05-07T20:23:44.7621600Z cpu cores : 8 2025-05-07T20:23:44.7621793Z apicid : 15 2025-05-07T20:23:44.7622023Z initial apicid : 15 2025-05-07T20:23:44.7622247Z fpu : yes 2025-05-07T20:23:44.7622441Z fpu_exception : yes 2025-05-07T20:23:44.7622650Z cpuid level : 13 2025-05-07T20:23:44.7622845Z wp : yes 2025-05-07T20:23:44.7624745Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:44.7626908Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:44.7627372Z bogomips : 5599.99 2025-05-07T20:23:44.7627580Z TLB size : 3072 4K pages 2025-05-07T20:23:44.7627810Z clflush size : 64 2025-05-07T20:23:44.7628021Z cache_alignment : 64 2025-05-07T20:23:44.7628286Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:44.7628589Z power management: 2025-05-07T20:23:44.7628724Z 2025-05-07T20:23:44.7628728Z 2025-05-07T20:23:44.7628854Z ################################################################################ 2025-05-07T20:23:44.7629157Z [INFO] Print PCI info ... 2025-05-07T20:23:44.7629391Z + lspci -v 2025-05-07T20:23:44.7629506Z 2025-05-07T20:23:44.7629733Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:44.7630113Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:44.7630423Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:44.7630624Z 2025-05-07T20:23:44.7630815Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:44.7631183Z Physical Slot: 1 2025-05-07T20:23:44.7631420Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:44.7631617Z 2025-05-07T20:23:44.7631858Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:44.7632273Z Physical Slot: 1 2025-05-07T20:23:44.7632522Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:44.7632742Z 2025-05-07T20:23:44.7633005Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:44.7633437Z Physical Slot: 3 2025-05-07T20:23:44.7633670Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:44.7634002Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:44.7634351Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:44.7634565Z 2025-05-07T20:23:44.7634858Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:44.7635356Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:44.7635635Z Physical Slot: 4 2025-05-07T20:23:44.7635882Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:44.7636251Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:44.7636600Z Capabilities: 2025-05-07T20:23:44.7636878Z Kernel driver in use: nvme 2025-05-07T20:23:44.7637040Z 2025-05-07T20:23:44.7637343Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:44.7637813Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:44.7638157Z Physical Slot: 5 2025-05-07T20:23:44.7638387Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:44.7638740Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:44.7639216Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:44.7639529Z Capabilities: 2025-05-07T20:23:44.7639788Z Kernel driver in use: ena 2025-05-07T20:23:44.7640025Z Kernel modules: ena 2025-05-07T20:23:44.7640161Z 2025-05-07T20:23:44.7640327Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:44.7640690Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:44.7640972Z Physical Slot: 30 2025-05-07T20:23:44.7641223Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:44.7641585Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:44.7642005Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:44.7642383Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:44.7642702Z Capabilities: 2025-05-07T20:23:44.7642964Z Kernel driver in use: nvidia 2025-05-07T20:23:44.7643208Z Kernel modules: nvidia 2025-05-07T20:23:44.7643357Z 2025-05-07T20:23:44.7643653Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:44.7644144Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:44.7644434Z Physical Slot: 31 2025-05-07T20:23:44.7644672Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:44.7645019Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:44.7645392Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:44.7645713Z Capabilities: 2025-05-07T20:23:44.7645965Z Kernel driver in use: nvme 2025-05-07T20:23:44.7646122Z 2025-05-07T20:23:44.7646126Z 2025-05-07T20:23:44.7646245Z ################################################################################ 2025-05-07T20:23:44.7646563Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:44.7646841Z + uname -a 2025-05-07T20:23:44.7646951Z 2025-05-07T20:23:44.7647345Z Linux ip-10-0-27-143.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:44.7647833Z 2025-05-07T20:23:44.7647915Z + uname -m 2025-05-07T20:23:44.7648028Z 2025-05-07T20:23:44.7648107Z x86_64 2025-05-07T20:23:44.7654417Z 2025-05-07T20:23:44.7654520Z + cat /proc/version 2025-05-07T20:23:44.7654670Z 2025-05-07T20:23:44.7655204Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:44.7655821Z 2025-05-07T20:23:44.7655909Z + cat /etc/os-release 2025-05-07T20:23:44.7656051Z 2025-05-07T20:23:44.7656149Z NAME="Amazon Linux" 2025-05-07T20:23:44.7656351Z VERSION="2023" 2025-05-07T20:23:44.7656549Z ID="amzn" 2025-05-07T20:23:44.7656734Z ID_LIKE="fedora" 2025-05-07T20:23:44.7656931Z VERSION_ID="2023" 2025-05-07T20:23:44.7657161Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:44.7657436Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:44.7657726Z ANSI_COLOR="0;33" 2025-05-07T20:23:44.7657962Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:44.7658348Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:44.7658774Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:44.7659174Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:44.7659932Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:44.7660305Z VENDOR_NAME="AWS" 2025-05-07T20:23:44.7660539Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:44.7660817Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:44.7660967Z 2025-05-07T20:23:44.7661269Z ################################################################################ 2025-05-07T20:23:44.7661573Z # Print EC2 Instance Info 2025-05-07T20:23:44.7661803Z # 2025-05-07T20:23:44.7662006Z # [2025-05-07T20:23:44.761Z] + print_ec2_info 2025-05-07T20:23:44.7662308Z ################################################################################ 2025-05-07T20:23:44.7662622Z 2025-05-07T20:23:44.7736146Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:44.7848301Z instance-id: i-050728826a2d12e7e 2025-05-07T20:23:44.7963641Z instance-type: g5.4xlarge 2025-05-07T20:23:44.8004872Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:44.8005230Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:44.8015189Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:44.8015547Z env: 2025-05-07T20:23:44.8015773Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:44.8016084Z BUILD_ENV: build_binary 2025-05-07T20:23:44.8016339Z BUILD_TARGET: genai 2025-05-07T20:23:44.8016571Z BUILD_VARIANT: cuda 2025-05-07T20:23:44.8016813Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:44.8017280Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:44.8017669Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:44.8018001Z ##[endgroup] 2025-05-07T20:23:45.1382029Z ################################################################################ 2025-05-07T20:23:45.1382482Z [INFO] Printing general display info ... 2025-05-07T20:23:45.1396928Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:45.2614753Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:45.2623072Z /usr/bin/sudo 2025-05-07T20:23:45.2634488Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:45.2644762Z /usr/bin/yum 2025-05-07T20:23:45.2645766Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:45.2667473Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:45.7472499Z Last metadata expiration check: 0:00:05 ago on Wed May 7 20:23:40 2025. 2025-05-07T20:23:45.8151820Z ================================================================================ 2025-05-07T20:23:45.8152533Z WARNING: 2025-05-07T20:23:45.8153183Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:45.8153821Z 2025-05-07T20:23:45.8154102Z Available Versions: 2025-05-07T20:23:45.8154541Z 2025-05-07T20:23:45.8154752Z Version 2023.7.20250331: 2025-05-07T20:23:45.8155360Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:45.8155849Z 2025-05-07T20:23:45.8156099Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:45.8156520Z 2025-05-07T20:23:45.8156689Z Release notes: 2025-05-07T20:23:45.8157480Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:45.8158197Z 2025-05-07T20:23:45.8158382Z Version 2023.7.20250414: 2025-05-07T20:23:45.8158970Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:45.8159904Z 2025-05-07T20:23:45.8160134Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:45.8160547Z 2025-05-07T20:23:45.8160724Z Release notes: 2025-05-07T20:23:45.8161480Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:45.8161990Z 2025-05-07T20:23:45.8162098Z Version 2023.7.20250428: 2025-05-07T20:23:45.8162416Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:45.8162658Z 2025-05-07T20:23:45.8162778Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:45.8162985Z 2025-05-07T20:23:45.8163069Z Release notes: 2025-05-07T20:23:45.8163452Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:45.8163810Z 2025-05-07T20:23:45.8163924Z ================================================================================ 2025-05-07T20:23:45.9300753Z Dependencies resolved. 2025-05-07T20:23:45.9586922Z ================================================================================ 2025-05-07T20:23:45.9587518Z Package Arch Version Repository Size 2025-05-07T20:23:45.9588022Z ================================================================================ 2025-05-07T20:23:45.9588435Z Upgrading: 2025-05-07T20:23:45.9589015Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:45.9589598Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:45.9589978Z 2025-05-07T20:23:45.9590348Z Transaction Summary 2025-05-07T20:23:45.9590701Z ================================================================================ 2025-05-07T20:23:45.9591114Z Upgrade 2 Packages 2025-05-07T20:23:45.9591277Z 2025-05-07T20:23:45.9591416Z Total download size: 6.9 M 2025-05-07T20:23:45.9591770Z Downloading Packages: 2025-05-07T20:23:46.0240896Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 19 MB/s | 1.2 MB 00:00 2025-05-07T20:23:46.0497077Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 63 MB/s | 5.7 MB 00:00 2025-05-07T20:23:46.0505963Z -------------------------------------------------------------------------------- 2025-05-07T20:23:46.0509178Z Total 76 MB/s | 6.9 MB 00:00 2025-05-07T20:23:46.0511759Z Running transaction check 2025-05-07T20:23:46.0611352Z Transaction check succeeded. 2025-05-07T20:23:46.0612606Z Running transaction test 2025-05-07T20:23:46.0908231Z Transaction test succeeded. 2025-05-07T20:23:46.0910981Z Running transaction 2025-05-07T20:23:46.6435460Z Preparing : 1/1 2025-05-07T20:23:46.7495370Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:46.7516598Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:46.7740714Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:46.7741466Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:46.7851068Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:46.7876992Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:46.9553518Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:46.9554322Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:46.9554965Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:46.9555496Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:47.0956013Z ================================================================================ 2025-05-07T20:23:47.0956942Z WARNING: 2025-05-07T20:23:47.0957592Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:47.0958204Z 2025-05-07T20:23:47.0958451Z Available Versions: 2025-05-07T20:23:47.0958806Z 2025-05-07T20:23:47.0958985Z Version 2023.7.20250331: 2025-05-07T20:23:47.0960007Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:47.0960499Z 2025-05-07T20:23:47.0960748Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:47.0961190Z 2025-05-07T20:23:47.0961362Z Release notes: 2025-05-07T20:23:47.0962155Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:47.0962565Z 2025-05-07T20:23:47.0962678Z Version 2023.7.20250414: 2025-05-07T20:23:47.0962982Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:47.0963234Z 2025-05-07T20:23:47.0963350Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:47.0963557Z 2025-05-07T20:23:47.0963643Z Release notes: 2025-05-07T20:23:47.0964030Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:47.0964383Z 2025-05-07T20:23:47.0964475Z Version 2023.7.20250428: 2025-05-07T20:23:47.0964776Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:47.0965017Z 2025-05-07T20:23:47.0965136Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:47.0965336Z 2025-05-07T20:23:47.0965703Z Release notes: 2025-05-07T20:23:47.0966084Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:47.0966443Z 2025-05-07T20:23:47.0966759Z ================================================================================ 2025-05-07T20:23:47.1533094Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:47.1533658Z 2025-05-07T20:23:47.1533782Z Upgraded: 2025-05-07T20:23:47.1534179Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:47.1534747Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:47.1535082Z 2025-05-07T20:23:47.1535178Z Complete! 2025-05-07T20:23:47.1983858Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:47.2007898Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:47.6572650Z Last metadata expiration check: 0:00:07 ago on Wed May 7 20:23:40 2025. 2025-05-07T20:23:47.6811822Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:47.7216035Z Dependencies resolved. 2025-05-07T20:23:47.7395824Z ================================================================================ 2025-05-07T20:23:47.7396932Z Package Architecture Version Repository Size 2025-05-07T20:23:47.7397762Z ================================================================================ 2025-05-07T20:23:47.7398552Z Installing: 2025-05-07T20:23:47.7399165Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:47.7399819Z 2025-05-07T20:23:47.7400004Z Transaction Summary 2025-05-07T20:23:47.7400504Z ================================================================================ 2025-05-07T20:23:47.7401087Z Install 1 Package 2025-05-07T20:23:47.7401362Z 2025-05-07T20:23:47.7401560Z Total download size: 319 k 2025-05-07T20:23:47.7402068Z Installed size: 837 k 2025-05-07T20:23:47.7402432Z Downloading Packages: 2025-05-07T20:23:47.8124013Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 7.1 MB/s | 319 kB 00:00 2025-05-07T20:23:47.8129431Z -------------------------------------------------------------------------------- 2025-05-07T20:23:47.8132153Z Total 4.3 MB/s | 319 kB 00:00 2025-05-07T20:23:47.8288458Z Running transaction check 2025-05-07T20:23:47.8342552Z Transaction check succeeded. 2025-05-07T20:23:47.8342851Z Running transaction test 2025-05-07T20:23:47.8803965Z Transaction test succeeded. 2025-05-07T20:23:47.8807695Z Running transaction 2025-05-07T20:23:47.9827969Z Preparing : 1/1 2025-05-07T20:23:48.0323582Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:48.1963829Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:48.3271994Z ================================================================================ 2025-05-07T20:23:48.3272531Z WARNING: 2025-05-07T20:23:48.3272880Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:48.3273193Z 2025-05-07T20:23:48.3273319Z Available Versions: 2025-05-07T20:23:48.3273504Z 2025-05-07T20:23:48.3273597Z Version 2023.7.20250331: 2025-05-07T20:23:48.3273903Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:48.3274149Z 2025-05-07T20:23:48.3274273Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:48.3274488Z 2025-05-07T20:23:48.3274577Z Release notes: 2025-05-07T20:23:48.3274984Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:48.3275342Z 2025-05-07T20:23:48.3275436Z Version 2023.7.20250414: 2025-05-07T20:23:48.3275734Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:48.3275981Z 2025-05-07T20:23:48.3276096Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:48.3276567Z 2025-05-07T20:23:48.3276658Z Release notes: 2025-05-07T20:23:48.3277035Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:48.3277399Z 2025-05-07T20:23:48.3277646Z Version 2023.7.20250428: 2025-05-07T20:23:48.3277951Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:48.3278191Z 2025-05-07T20:23:48.3278310Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:48.3278511Z 2025-05-07T20:23:48.3278598Z Release notes: 2025-05-07T20:23:48.3278981Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:48.3279337Z 2025-05-07T20:23:48.3279451Z ================================================================================ 2025-05-07T20:23:48.3622197Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:48.3622531Z 2025-05-07T20:23:48.3622629Z Installed: 2025-05-07T20:23:48.3622944Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:48.3623259Z 2025-05-07T20:23:48.3623345Z Complete! 2025-05-07T20:23:48.4070239Z + hostname 2025-05-07T20:23:48.4070396Z 2025-05-07T20:23:48.4084312Z ip-10-0-27-143.ec2.internal 2025-05-07T20:23:48.4085791Z 2025-05-07T20:23:48.4085997Z + sudo lshw -C display 2025-05-07T20:23:48.4086162Z 2025-05-07T20:23:48.8891098Z *-display:0 UNCLAIMED 2025-05-07T20:23:48.8891792Z description: VGA compatible controller 2025-05-07T20:23:48.8892449Z product: Amazon.com, Inc. 2025-05-07T20:23:48.8892914Z vendor: Amazon.com, Inc. 2025-05-07T20:23:48.8893276Z physical id: 3 2025-05-07T20:23:48.8893521Z bus info: pci@0000:00:03.0 2025-05-07T20:23:48.8893787Z version: 00 2025-05-07T20:23:48.8894001Z width: 32 bits 2025-05-07T20:23:48.8894235Z clock: 33MHz 2025-05-07T20:23:48.8894502Z capabilities: vga_controller bus_master 2025-05-07T20:23:48.8894827Z configuration: latency=0 2025-05-07T20:23:48.8895182Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:48.8895525Z *-display:1 2025-05-07T20:23:48.8895758Z description: 3D controller 2025-05-07T20:23:48.8896053Z product: GA102GL [A10G] 2025-05-07T20:23:48.8896326Z vendor: NVIDIA Corporation 2025-05-07T20:23:48.8896604Z physical id: 1e 2025-05-07T20:23:48.8896843Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:48.8897107Z version: a1 2025-05-07T20:23:48.8897331Z width: 64 bits 2025-05-07T20:23:48.8897549Z clock: 33MHz 2025-05-07T20:23:48.8897850Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:48.8898227Z configuration: driver=nvidia latency=0 2025-05-07T20:23:48.8898834Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:48.8929063Z 2025-05-07T20:23:48.8929565Z ################################################################################ 2025-05-07T20:23:48.8929908Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:48.9058015Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:48.9243693Z Wed May 7 20:23:48 2025 2025-05-07T20:23:48.9244065Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:48.9244566Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:48.9245048Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:48.9245533Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:48.9246038Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:48.9246458Z | | | MIG M. | 2025-05-07T20:23:48.9246787Z |=========================================+========================+======================| 2025-05-07T20:23:48.9379335Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:48.9379966Z | 0% 26C P8 9W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:48.9380336Z | | | N/A | 2025-05-07T20:23:48.9380723Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:48.9384176Z 2025-05-07T20:23:48.9384575Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:48.9384992Z | Processes: | 2025-05-07T20:23:48.9385420Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:48.9385822Z | ID ID Usage | 2025-05-07T20:23:48.9386169Z |=========================================================================================| 2025-05-07T20:23:48.9389464Z | No running processes found | 2025-05-07T20:23:48.9390075Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:49.1815908Z ################################################################################ 2025-05-07T20:23:49.1816280Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:49.1959413Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:49.1960210Z [CHECK] rocminfo not found 2025-05-07T20:23:49.1969760Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:49.1971209Z [CHECK] rocm-smi not found 2025-05-07T20:23:49.2014748Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:49.2015182Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:49.2027443Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:49.2027790Z env: 2025-05-07T20:23:49.2028009Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:49.2028308Z BUILD_ENV: build_binary 2025-05-07T20:23:49.2028556Z BUILD_TARGET: genai 2025-05-07T20:23:49.2028775Z BUILD_VARIANT: cuda 2025-05-07T20:23:49.2029012Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:49.2029274Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:49.2029574Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:49.2029905Z ##[endgroup] 2025-05-07T20:23:49.5368837Z ################################################################################ 2025-05-07T20:23:49.5369213Z # Setup Miniconda 2025-05-07T20:23:49.5369421Z # 2025-05-07T20:23:49.5383924Z # [2025-05-07T20:23:49.538Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:49.5384340Z ################################################################################ 2025-05-07T20:23:49.5384560Z 2025-05-07T20:23:49.5398802Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:49.6334969Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:49.6335340Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:49.6335537Z 2025-05-07T20:23:49.6353673Z 2025-05-07T20:23:49.6353976Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:49.6379889Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:50.6243722Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:50.6244110Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:50.6244361Z 2025-05-07T20:23:50.6390059Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:51.0898913Z Unpacking payload ... 2025-05-07T20:23:51.6080602Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:52.4141639Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:54.5148071Z 2025-05-07T20:23:54.5148615Z Installing base environment... 2025-05-07T20:23:54.5148856Z 2025-05-07T20:23:55.6012313Z Preparing transaction: ...working... done 2025-05-07T20:23:58.5303968Z Executing transaction: ...working... done 2025-05-07T20:23:59.1918828Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:59.2792517Z installation finished. 2025-05-07T20:23:59.2801379Z 2025-05-07T20:23:59.2801608Z + rm -f miniconda.sh 2025-05-07T20:23:59.2801793Z 2025-05-07T20:23:59.3116007Z 2025-05-07T20:23:59.3116592Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:59.3116962Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:59.3117196Z 2025-05-07T20:23:59.6792088Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:59.6792638Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:59.6793107Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:59.6793575Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:59.6793994Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:59.6794379Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:59.6794805Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:59.6795240Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:59.6795690Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:59.6796476Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:59.6797009Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:59.6797372Z modified /home/ec2-user/.bashrc 2025-05-07T20:23:59.6797560Z 2025-05-07T20:23:59.6797753Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:23:59.6798046Z 2025-05-07T20:23:59.7434561Z 2025-05-07T20:23:59.7435047Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:59.7435248Z 2025-05-07T20:24:00.5767496Z 2025-05-07T20:24:00.5768085Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:24:00.5792476Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:24:13.9231589Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:24:15.4947767Z Solving environment: \ | / - \ | / - \ | / - done 2025-05-07T20:24:15.5914357Z 2025-05-07T20:24:15.5914739Z ## Package Plan ## 2025-05-07T20:24:15.5915115Z 2025-05-07T20:24:15.5915515Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:15.5916096Z 2025-05-07T20:24:15.5916327Z added / updated specs: 2025-05-07T20:24:15.5916929Z - conda-libmamba-solver 2025-05-07T20:24:15.5917438Z - libarchive 2025-05-07T20:24:15.5917814Z - libmamba 2025-05-07T20:24:15.5918173Z - libmambapy 2025-05-07T20:24:15.5918412Z 2025-05-07T20:24:15.5918420Z 2025-05-07T20:24:15.5918655Z The following packages will be downloaded: 2025-05-07T20:24:15.5919416Z 2025-05-07T20:24:15.5919633Z package | build 2025-05-07T20:24:15.5920191Z ---------------------------|----------------- 2025-05-07T20:24:15.5920923Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:24:15.5921767Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:24:15.5922533Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:24:15.5923366Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:24:15.5924127Z ------------------------------------------------------------ 2025-05-07T20:24:15.5924477Z Total: 1.4 MB 2025-05-07T20:24:15.5924686Z 2025-05-07T20:24:15.5924797Z The following packages will be UPDATED: 2025-05-07T20:24:15.5925003Z 2025-05-07T20:24:15.5928662Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:15.5929454Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:24:15.5929827Z 2025-05-07T20:24:15.5930049Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:15.5930420Z 2025-05-07T20:24:15.5930736Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:24:15.5931508Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:24:15.5931982Z 2025-05-07T20:24:15.5931987Z 2025-05-07T20:24:15.5931991Z 2025-05-07T20:24:15.5932133Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:15.5932503Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:24:15.5932718Z 2025-05-07T20:24:15.5933311Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:24:15.5933556Z 2025-05-07T20:24:15.5936499Z 2025-05-07T20:24:15.5945944Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:24:15.5946253Z 2025-05-07T20:24:15.5946259Z 2025-05-07T20:24:15.5946264Z 2025-05-07T20:24:15.6428381Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:24:15.6428702Z 2025-05-07T20:24:15.6428708Z 2025-05-07T20:24:15.6506016Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:15.6506297Z 2025-05-07T20:24:15.6506303Z 2025-05-07T20:24:15.6598620Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:15.6598885Z 2025-05-07T20:24:15.6598891Z 2025-05-07T20:24:15.6598897Z 2025-05-07T20:24:15.6733593Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:15.6733900Z 2025-05-07T20:24:15.6733906Z 2025-05-07T20:24:15.6733911Z 2025-05-07T20:24:15.6764410Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:15.6768348Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:15.6768593Z 2025-05-07T20:24:15.6906859Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:15.6907131Z 2025-05-07T20:24:15.7918506Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:15.7918897Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:15.7924274Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:15.7924608Z 2025-05-07T20:24:15.7924815Z 2025-05-07T20:24:15.7924997Z  2025-05-07T20:24:15.7925205Z 2025-05-07T20:24:15.7925210Z 2025-05-07T20:24:15.7925376Z  2025-05-07T20:24:15.7925587Z 2025-05-07T20:24:15.7925595Z 2025-05-07T20:24:15.7925599Z 2025-05-07T20:24:15.7925779Z  done 2025-05-07T20:24:15.8928341Z Preparing transaction: | done 2025-05-07T20:24:15.9934503Z Verifying transaction: - done 2025-05-07T20:24:17.2954321Z Executing transaction: | / - \ | / - \ | / - \ | done 2025-05-07T20:24:19.0078145Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:24:19.0103103Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:24:19.9478958Z Channels: 2025-05-07T20:24:19.9479205Z - defaults 2025-05-07T20:24:19.9479427Z Platform: linux-64 2025-05-07T20:24:21.1477784Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:24:21.2691233Z Solving environment: - \ Channels: 2025-05-07T20:24:21.2691545Z - defaults 2025-05-07T20:24:21.2691769Z Platform: linux-64 2025-05-07T20:24:21.5605381Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:24:21.7711340Z Solving environment: - \ | / done 2025-05-07T20:24:21.8577504Z done 2025-05-07T20:24:21.9236868Z 2025-05-07T20:24:21.9237149Z ## Package Plan ## 2025-05-07T20:24:21.9237310Z 2025-05-07T20:24:21.9237450Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:21.9237684Z 2025-05-07T20:24:21.9237779Z added / updated specs: 2025-05-07T20:24:21.9238021Z - conda 2025-05-07T20:24:21.9238137Z 2025-05-07T20:24:21.9238153Z 2025-05-07T20:24:21.9238276Z The following packages will be downloaded: 2025-05-07T20:24:21.9238483Z 2025-05-07T20:24:21.9238596Z package | build 2025-05-07T20:24:21.9238914Z ---------------------------|----------------- 2025-05-07T20:24:21.9239255Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:24:21.9239631Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:24:21.9239990Z ------------------------------------------------------------ 2025-05-07T20:24:21.9240324Z Total: 1.4 MB 2025-05-07T20:24:21.9240528Z 2025-05-07T20:24:21.9240860Z The following packages will be UPDATED: 2025-05-07T20:24:21.9241078Z 2025-05-07T20:24:21.9241376Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:21.9241870Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:24:21.9242116Z 2025-05-07T20:24:21.9242120Z 2025-05-07T20:24:21.9242124Z 2025-05-07T20:24:21.9242266Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:21.9242626Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:24:21.9242835Z 2025-05-07T20:24:21.9849135Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:24:21.9849701Z 2025-05-07T20:24:22.0556327Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:22.1086045Z pip-25.1 | 1.3 MB | 1 | 1% 2025-05-07T20:24:22.1094517Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:22.1095025Z 2025-05-07T20:24:22.1098060Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:22.1098566Z 2025-05-07T20:24:22.2120582Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:22.2121389Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:22.2125160Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:22.2125503Z 2025-05-07T20:24:22.2125703Z 2025-05-07T20:24:22.2126045Z  done 2025-05-07T20:24:22.3128692Z Preparing transaction: \ done 2025-05-07T20:24:22.4134343Z Verifying transaction: / done 2025-05-07T20:24:24.4164362Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:25.0339600Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:24:25.0344348Z + conda clean --packages --tarball -y 2025-05-07T20:24:25.0344568Z 2025-05-07T20:24:26.0432811Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:24:26.0433366Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:24:26.1087860Z 2025-05-07T20:24:26.1096543Z + conda clean --all -y 2025-05-07T20:24:26.1096885Z 2025-05-07T20:24:26.6507852Z There are no unused tarball(s) to remove. 2025-05-07T20:24:26.6508208Z Will remove 1 index cache(s). 2025-05-07T20:24:26.6508491Z There are no unused package(s) to remove. 2025-05-07T20:24:26.6508801Z There are no tempfile(s) to remove. 2025-05-07T20:24:26.6509092Z There are no logfile(s) to remove. 2025-05-07T20:24:26.7165547Z 2025-05-07T20:24:26.7169889Z + conda info 2025-05-07T20:24:26.7170046Z 2025-05-07T20:24:27.4908510Z 2025-05-07T20:24:27.4908947Z active environment : base 2025-05-07T20:24:27.4909402Z active env location : /home/ec2-user/miniconda 2025-05-07T20:24:27.4909721Z shell level : 1 2025-05-07T20:24:27.4909997Z user config file : /home/ec2-user/.condarc 2025-05-07T20:24:27.4910378Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:24:27.4910772Z conda version : 25.3.1 2025-05-07T20:24:27.4911050Z conda-build version : not installed 2025-05-07T20:24:27.4911346Z python version : 3.13.2.final.0 2025-05-07T20:24:27.4911638Z solver : libmamba (default) 2025-05-07T20:24:27.4911941Z virtual packages : __archspec=1=zen2 2025-05-07T20:24:27.4912234Z __conda=25.3.1=0 2025-05-07T20:24:27.4912507Z __cuda=12.8=0 2025-05-07T20:24:27.4912774Z __glibc=2.34=0 2025-05-07T20:24:27.4913048Z __linux=6.1.130=0 2025-05-07T20:24:27.4913325Z __unix=0=0 2025-05-07T20:24:27.4913651Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:24:27.4914052Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:24:27.4914398Z conda av metadata url : None 2025-05-07T20:24:27.4915070Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:24:27.4915509Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:24:27.4915890Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:24:27.4916259Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:24:27.4916620Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:24:27.4916958Z /home/ec2-user/.conda/pkgs 2025-05-07T20:24:27.4917294Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:24:27.4917620Z /home/ec2-user/.conda/envs 2025-05-07T20:24:27.4917952Z platform : linux-64 2025-05-07T20:24:27.4918791Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:24:27.4919607Z UID:GID : 1000:1000 2025-05-07T20:24:27.4919879Z netrc file : None 2025-05-07T20:24:27.4920138Z offline mode : False 2025-05-07T20:24:27.4920304Z 2025-05-07T20:24:27.5569283Z 2025-05-07T20:24:27.5569790Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:24:27.5570825Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_9383b14d-7f66-434c-93e5-e2304d3bdbb6 ... 2025-05-07T20:24:27.5571911Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:24:27.5691697Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.12 2025-05-07T20:24:27.5692190Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.12 2025-05-07T20:24:27.5710436Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:27.5710796Z env: 2025-05-07T20:24:27.5711023Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:27.5711320Z BUILD_ENV: build_binary 2025-05-07T20:24:27.5711571Z BUILD_TARGET: genai 2025-05-07T20:24:27.5711795Z BUILD_VARIANT: cuda 2025-05-07T20:24:27.5712204Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:27.5712460Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:27.5712762Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:27.5713093Z ##[endgroup] 2025-05-07T20:24:27.9091783Z ################################################################################ 2025-05-07T20:24:27.9092297Z # Create Conda Environment 2025-05-07T20:24:27.9092575Z # 2025-05-07T20:24:27.9107308Z # [2025-05-07T20:24:27.910Z] + create_conda_environment build_binary 3.12 2025-05-07T20:24:27.9107876Z ################################################################################ 2025-05-07T20:24:27.9108218Z 2025-05-07T20:24:27.9122926Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:28.0040539Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:28.0041044Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:24:28.0041470Z + conda info --envs 2025-05-07T20:24:28.0041638Z 2025-05-07T20:24:28.7601589Z 2025-05-07T20:24:28.7602089Z # conda environments: 2025-05-07T20:24:28.7602371Z # 2025-05-07T20:24:28.7602605Z base /home/ec2-user/miniconda 2025-05-07T20:24:28.7602830Z 2025-05-07T20:24:28.8257129Z 2025-05-07T20:24:28.8257971Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:24:30.4592128Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:30.4592497Z 2025-05-07T20:24:30.4607634Z 2025-05-07T20:24:30.4617115Z [SETUP] Creating new Conda environment (Python 3.12) ... 2025-05-07T20:24:30.4638697Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.12 2025-05-07T20:24:31.2239724Z Channels: 2025-05-07T20:24:31.2240050Z - defaults 2025-05-07T20:24:31.2240333Z Platform: linux-64 2025-05-07T20:24:32.7499577Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done 2025-05-07T20:24:32.8505047Z Solving environment: / done 2025-05-07T20:24:32.8792153Z 2025-05-07T20:24:32.8792485Z ## Package Plan ## 2025-05-07T20:24:32.8792734Z 2025-05-07T20:24:32.8793033Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:32.8793435Z 2025-05-07T20:24:32.8793570Z added / updated specs: 2025-05-07T20:24:32.8793896Z - python=3.12 2025-05-07T20:24:32.8794032Z 2025-05-07T20:24:32.8794037Z 2025-05-07T20:24:32.8794162Z The following packages will be downloaded: 2025-05-07T20:24:32.8794374Z 2025-05-07T20:24:32.8794504Z package | build 2025-05-07T20:24:32.8794826Z ---------------------------|----------------- 2025-05-07T20:24:32.8795188Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:32.8795583Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:32.8795993Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:32.8796500Z python-3.12.9 | h5148396_0 34.7 MB 2025-05-07T20:24:32.8796965Z setuptools-78.1.1 | py312h06a4308_0 2.2 MB 2025-05-07T20:24:32.8797356Z wheel-0.45.1 | py312h06a4308_0 147 KB 2025-05-07T20:24:32.8797720Z ------------------------------------------------------------ 2025-05-07T20:24:32.8798075Z Total: 37.2 MB 2025-05-07T20:24:32.8798279Z 2025-05-07T20:24:32.8798413Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:32.8798634Z 2025-05-07T20:24:32.8799199Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:32.8799654Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:32.8800073Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:32.8800601Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:32.8801081Z expat pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 2025-05-07T20:24:32.8801528Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:32.8802130Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:32.8802553Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:32.8802979Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:32.8803433Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:32.8804032Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:32.8804603Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:32.8805047Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:32.8805461Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:32.8805868Z python pkgs/main/linux-64::python-3.12.9-h5148396_0 2025-05-07T20:24:32.8806292Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:32.8806767Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py312h06a4308_0 2025-05-07T20:24:32.8807245Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:32.8807635Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:32.8808019Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:32.8808441Z wheel pkgs/main/linux-64::wheel-0.45.1-py312h06a4308_0 2025-05-07T20:24:32.8808845Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:32.8809221Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:32.8809464Z 2025-05-07T20:24:32.8809468Z 2025-05-07T20:24:32.8809473Z 2025-05-07T20:24:32.8809619Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:32.8809996Z python-3.12.9 | 34.7 MB | | 0% 2025-05-07T20:24:32.8810222Z 2025-05-07T20:24:32.8810612Z setuptools-78.1.1 | 2.2 MB | | 0%  2025-05-07T20:24:32.8810882Z 2025-05-07T20:24:32.8810886Z 2025-05-07T20:24:32.8825743Z wheel-0.45.1 | 147 KB | | 0%  2025-05-07T20:24:32.8826068Z 2025-05-07T20:24:32.8826074Z 2025-05-07T20:24:32.8828812Z 2025-05-07T20:24:32.8836262Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:32.8836536Z 2025-05-07T20:24:32.8836563Z 2025-05-07T20:24:32.8836567Z 2025-05-07T20:24:32.8842935Z 2025-05-07T20:24:32.8861885Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:32.8862212Z 2025-05-07T20:24:32.8862217Z 2025-05-07T20:24:32.8862221Z 2025-05-07T20:24:32.8862226Z 2025-05-07T20:24:32.8862231Z 2025-05-07T20:24:32.9270460Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:32.9270771Z 2025-05-07T20:24:32.9270776Z 2025-05-07T20:24:32.9270790Z 2025-05-07T20:24:32.9270794Z 2025-05-07T20:24:32.9306915Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:32.9307273Z 2025-05-07T20:24:32.9312738Z 2025-05-07T20:24:32.9513844Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:24:32.9514191Z 2025-05-07T20:24:32.9514222Z 2025-05-07T20:24:32.9514227Z 2025-05-07T20:24:32.9514232Z 2025-05-07T20:24:32.9518898Z 2025-05-07T20:24:32.9797968Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:32.9807703Z python-3.12.9 | 34.7 MB | 7 | 7% 2025-05-07T20:24:32.9807959Z 2025-05-07T20:24:32.9828181Z setuptools-78.1.1 | 2.2 MB | #####7 | 57%  2025-05-07T20:24:32.9828500Z 2025-05-07T20:24:32.9828506Z 2025-05-07T20:24:32.9828512Z 2025-05-07T20:24:32.9828517Z 2025-05-07T20:24:32.9831187Z 2025-05-07T20:24:32.9837039Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:32.9837322Z 2025-05-07T20:24:32.9837327Z 2025-05-07T20:24:32.9838371Z 2025-05-07T20:24:32.9845600Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:32.9845877Z 2025-05-07T20:24:32.9846153Z 2025-05-07T20:24:32.9849160Z 2025-05-07T20:24:33.0260582Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:33.0260886Z 2025-05-07T20:24:33.0261093Z 2025-05-07T20:24:33.0262999Z 2025-05-07T20:24:33.0432296Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:33.0432867Z 2025-05-07T20:24:33.0800965Z setuptools-78.1.1 | 2.2 MB | ########## | 100%  2025-05-07T20:24:33.0865151Z python-3.12.9 | 34.7 MB | #8 | 19% 2025-05-07T20:24:33.0865505Z 2025-05-07T20:24:33.0865511Z 2025-05-07T20:24:33.0865517Z 2025-05-07T20:24:33.0866915Z 2025-05-07T20:24:33.0873403Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:33.0873799Z 2025-05-07T20:24:33.0873805Z 2025-05-07T20:24:33.0873811Z 2025-05-07T20:24:33.0875590Z 2025-05-07T20:24:33.1107668Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:33.1108004Z 2025-05-07T20:24:33.1108010Z 2025-05-07T20:24:33.1110962Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:24:33.1111246Z 2025-05-07T20:24:33.1111329Z 2025-05-07T20:24:33.1802312Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:24:33.3679131Z python-3.12.9 | 34.7 MB | #####4 | 54% 2025-05-07T20:24:33.3679983Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:24:33.4448927Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:24:33.4449190Z 2025-05-07T20:24:34.0215425Z setuptools-78.1.1 | 2.2 MB | ########## | 100%  2025-05-07T20:24:34.0222465Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:24:34.0222818Z 2025-05-07T20:24:34.0223020Z 2025-05-07T20:24:34.0223235Z  2025-05-07T20:24:34.0223435Z 2025-05-07T20:24:34.0223440Z 2025-05-07T20:24:34.0223623Z  2025-05-07T20:24:34.0223840Z 2025-05-07T20:24:34.0223844Z 2025-05-07T20:24:34.0223847Z 2025-05-07T20:24:34.0224018Z  2025-05-07T20:24:34.0224228Z 2025-05-07T20:24:34.0224232Z 2025-05-07T20:24:34.0224236Z 2025-05-07T20:24:34.0224239Z 2025-05-07T20:24:34.0224413Z  2025-05-07T20:24:34.0224624Z 2025-05-07T20:24:34.0224628Z 2025-05-07T20:24:34.0224631Z 2025-05-07T20:24:34.0224641Z 2025-05-07T20:24:34.0224645Z 2025-05-07T20:24:34.0224847Z  done 2025-05-07T20:24:34.2331618Z Preparing transaction: \ | done 2025-05-07T20:24:35.6495024Z Verifying transaction: - \ | / - \ | / - \ | / - done 2025-05-07T20:24:37.9723929Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:38.0230226Z # 2025-05-07T20:24:38.0230502Z # To activate this environment, use 2025-05-07T20:24:38.0231025Z # 2025-05-07T20:24:38.0231266Z # $ conda activate build_binary 2025-05-07T20:24:38.0231538Z # 2025-05-07T20:24:38.0231760Z # To deactivate an active environment, use 2025-05-07T20:24:38.0232075Z # 2025-05-07T20:24:38.0232287Z # $ conda deactivate 2025-05-07T20:24:38.0232463Z 2025-05-07T20:24:38.1311278Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:38.1333322Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:41.1392035Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (25.1) 2025-05-07T20:24:41.1392935Z Collecting pip 2025-05-07T20:24:41.1393404Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:41.1394010Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:41.1397352Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 53.7 MB/s eta 0:00:00 2025-05-07T20:24:41.1398395Z Installing collected packages: pip 2025-05-07T20:24:41.1398835Z Attempting uninstall: pip 2025-05-07T20:24:41.1399257Z Found existing installation: pip 25.1 2025-05-07T20:24:41.1399682Z Uninstalling pip-25.1: 2025-05-07T20:24:41.1400080Z Successfully uninstalled pip-25.1 2025-05-07T20:24:41.1400538Z Successfully installed pip-25.1.1 2025-05-07T20:24:41.1400804Z 2025-05-07T20:24:41.2046469Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:41.2069319Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:42.0608434Z Channels: 2025-05-07T20:24:42.0608674Z - conda-forge 2025-05-07T20:24:42.0608896Z Platform: linux-64 2025-05-07T20:24:52.4420500Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:54.1494446Z Solving environment: / - \ | / done 2025-05-07T20:24:54.2127575Z 2025-05-07T20:24:54.2128158Z ## Package Plan ## 2025-05-07T20:24:54.2128400Z 2025-05-07T20:24:54.2128619Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:54.2128917Z 2025-05-07T20:24:54.2129033Z added / updated specs: 2025-05-07T20:24:54.2129306Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:54.2129505Z 2025-05-07T20:24:54.2129509Z 2025-05-07T20:24:54.2129632Z The following packages will be downloaded: 2025-05-07T20:24:54.2129850Z 2025-05-07T20:24:54.2129995Z package | build 2025-05-07T20:24:54.2130358Z ---------------------------|----------------- 2025-05-07T20:24:54.2130799Z cffi-1.17.1 | py312h06ac9bb_0 288 KB conda-forge 2025-05-07T20:24:54.2131246Z cryptography-44.0.3 | py312hda17c39_0 1.5 MB conda-forge 2025-05-07T20:24:54.2131681Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:24:54.2132087Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:24:54.2132505Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:54.2132920Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:54.2133447Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:54.2133852Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:24:54.2134271Z libsqlite-3.46.0 | hde9e2c9_0 845 KB conda-forge 2025-05-07T20:24:54.2134700Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:24:54.2135108Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:54.2135666Z libzlib-1.2.13 | h4ab18f5_6 60 KB conda-forge 2025-05-07T20:24:54.2136080Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:54.2136497Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:54.2136930Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:54.2137368Z python-3.12.2 |hab00c5b_0_cpython 30.8 MB conda-forge 2025-05-07T20:24:54.2137790Z python_abi-3.12 | 7_cp312 7 KB conda-forge 2025-05-07T20:24:54.2138244Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:54.2139113Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:54.2139552Z zlib-1.2.13 | h4ab18f5_6 91 KB conda-forge 2025-05-07T20:24:54.2139933Z ------------------------------------------------------------ 2025-05-07T20:24:54.2140270Z Total: 38.6 MB 2025-05-07T20:24:54.2140488Z 2025-05-07T20:24:54.2140618Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:54.2140907Z 2025-05-07T20:24:54.2141323Z cffi conda-forge/linux-64::cffi-1.17.1-py312h06ac9bb_0 2025-05-07T20:24:54.2141915Z cryptography conda-forge/linux-64::cryptography-44.0.3-py312hda17c39_0 2025-05-07T20:24:54.2142411Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:24:54.2142843Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:54.2143266Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:24:54.2145863Z libsqlite conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 2025-05-07T20:24:54.2146369Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:54.2146919Z libzlib conda-forge/linux-64::libzlib-1.2.13-h4ab18f5_6 2025-05-07T20:24:54.2147366Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:54.2147831Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:54.2148282Z python_abi conda-forge/noarch::python_abi-3.12-7_cp312 2025-05-07T20:24:54.2148786Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:54.2149360Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:54.2149689Z 2025-05-07T20:24:54.2149805Z The following packages will be UPDATED: 2025-05-07T20:24:54.2150037Z 2025-05-07T20:24:54.2150501Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:54.2151486Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:54.2152234Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:54.2152958Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:24:54.2153676Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:54.2154363Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.2.13-h4ab18f5_6 2025-05-07T20:24:54.2154751Z 2025-05-07T20:24:54.2154995Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:54.2155356Z 2025-05-07T20:24:54.2155630Z expat pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 2025-05-07T20:24:54.2156338Z python pkgs/main::python-3.12.9-h5148396_0 --> conda-forge::python-3.12.2-hab00c5b_0_cpython 2025-05-07T20:24:54.2156787Z 2025-05-07T20:24:54.2156791Z 2025-05-07T20:24:54.2156795Z 2025-05-07T20:24:54.2156951Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:54.2157372Z python-3.12.2 | 30.8 MB | | 0% 2025-05-07T20:24:54.2157596Z 2025-05-07T20:24:54.2157983Z openssl-3.5.0 | 3.0 MB | | 0%  2025-05-07T20:24:54.2158211Z 2025-05-07T20:24:54.2158219Z 2025-05-07T20:24:54.2164616Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:54.2164989Z 2025-05-07T20:24:54.2164993Z 2025-05-07T20:24:54.2167792Z 2025-05-07T20:24:54.2183075Z libsqlite-3.46.0 | 845 KB | | 0%  2025-05-07T20:24:54.2183449Z 2025-05-07T20:24:54.2183455Z 2025-05-07T20:24:54.2183460Z 2025-05-07T20:24:54.2185176Z 2025-05-07T20:24:54.2200077Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:54.2200350Z 2025-05-07T20:24:54.2200354Z 2025-05-07T20:24:54.2200358Z 2025-05-07T20:24:54.2200362Z 2025-05-07T20:24:54.2200366Z 2025-05-07T20:24:54.2201315Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:54.2201676Z 2025-05-07T20:24:54.2201689Z 2025-05-07T20:24:54.2201695Z 2025-05-07T20:24:54.2201702Z 2025-05-07T20:24:54.2201706Z 2025-05-07T20:24:54.2201710Z 2025-05-07T20:24:54.2203540Z cffi-1.17.1 | 288 KB | | 0%  2025-05-07T20:24:54.2203990Z 2025-05-07T20:24:54.2203994Z 2025-05-07T20:24:54.2203997Z 2025-05-07T20:24:54.2204001Z 2025-05-07T20:24:54.2204005Z 2025-05-07T20:24:54.2204008Z 2025-05-07T20:24:54.2204012Z 2025-05-07T20:24:54.2205039Z expat-2.7.0 | 137 KB | | 0%  2025-05-07T20:24:54.2205379Z 2025-05-07T20:24:54.2205382Z 2025-05-07T20:24:54.2205391Z 2025-05-07T20:24:54.2205395Z 2025-05-07T20:24:54.2205398Z 2025-05-07T20:24:54.2205412Z 2025-05-07T20:24:54.2205416Z 2025-05-07T20:24:54.2205419Z 2025-05-07T20:24:54.2215385Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:54.2215671Z 2025-05-07T20:24:54.2215675Z 2025-05-07T20:24:54.2215679Z 2025-05-07T20:24:54.2215682Z 2025-05-07T20:24:54.2215686Z 2025-05-07T20:24:54.2215690Z 2025-05-07T20:24:54.2215701Z 2025-05-07T20:24:54.2215705Z 2025-05-07T20:24:54.2215708Z 2025-05-07T20:24:54.2216715Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:54.2217062Z 2025-05-07T20:24:54.2217066Z 2025-05-07T20:24:54.2217083Z 2025-05-07T20:24:54.2217086Z 2025-05-07T20:24:54.2217090Z 2025-05-07T20:24:54.2217094Z 2025-05-07T20:24:54.2217097Z 2025-05-07T20:24:54.2217101Z 2025-05-07T20:24:54.2217105Z 2025-05-07T20:24:54.2217108Z 2025-05-07T20:24:54.2219878Z libxcrypt-4.4.36 | 98 KB | | 0%  2025-05-07T20:24:54.2220173Z 2025-05-07T20:24:54.2220185Z 2025-05-07T20:24:54.2220189Z 2025-05-07T20:24:54.2220193Z 2025-05-07T20:24:54.2220196Z 2025-05-07T20:24:54.2220200Z 2025-05-07T20:24:54.2220203Z 2025-05-07T20:24:54.2220207Z 2025-05-07T20:24:54.2220210Z 2025-05-07T20:24:54.2220214Z 2025-05-07T20:24:54.2220217Z 2025-05-07T20:24:54.2220971Z zlib-1.2.13 | 91 KB | | 0%  2025-05-07T20:24:54.2221229Z 2025-05-07T20:24:54.2221238Z 2025-05-07T20:24:54.2221242Z 2025-05-07T20:24:54.2221245Z 2025-05-07T20:24:54.2221256Z 2025-05-07T20:24:54.2221259Z 2025-05-07T20:24:54.2221263Z 2025-05-07T20:24:54.2221267Z 2025-05-07T20:24:54.2221270Z 2025-05-07T20:24:54.2221274Z 2025-05-07T20:24:54.2221278Z 2025-05-07T20:24:54.2222781Z 2025-05-07T20:24:54.2224390Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:54.2224794Z 2025-05-07T20:24:54.2224800Z 2025-05-07T20:24:54.2224813Z 2025-05-07T20:24:54.2224818Z 2025-05-07T20:24:54.2224844Z 2025-05-07T20:24:54.2224849Z 2025-05-07T20:24:54.2224854Z 2025-05-07T20:24:54.2224860Z 2025-05-07T20:24:54.2224865Z 2025-05-07T20:24:54.2224870Z 2025-05-07T20:24:54.2224875Z 2025-05-07T20:24:54.2224880Z 2025-05-07T20:24:54.2224885Z 2025-05-07T20:24:54.2227148Z libexpat-2.7.0 | 73 KB | | 0%  2025-05-07T20:24:54.2227512Z 2025-05-07T20:24:54.2227516Z 2025-05-07T20:24:54.2227519Z 2025-05-07T20:24:54.2227523Z 2025-05-07T20:24:54.2227526Z 2025-05-07T20:24:54.2227537Z 2025-05-07T20:24:54.2227541Z 2025-05-07T20:24:54.2227544Z 2025-05-07T20:24:54.2227548Z 2025-05-07T20:24:54.2227552Z 2025-05-07T20:24:54.2227555Z 2025-05-07T20:24:54.2227559Z 2025-05-07T20:24:54.2227562Z 2025-05-07T20:24:54.2227566Z 2025-05-07T20:24:54.2228261Z libzlib-1.2.13 | 60 KB | | 0%  2025-05-07T20:24:54.2228604Z 2025-05-07T20:24:54.2228608Z 2025-05-07T20:24:54.2228613Z 2025-05-07T20:24:54.2228761Z 2025-05-07T20:24:54.2228767Z 2025-05-07T20:24:54.2228772Z 2025-05-07T20:24:54.2228784Z 2025-05-07T20:24:54.2228788Z 2025-05-07T20:24:54.2228807Z 2025-05-07T20:24:54.2228811Z 2025-05-07T20:24:54.2228815Z 2025-05-07T20:24:54.2228819Z 2025-05-07T20:24:54.2228822Z 2025-05-07T20:24:54.2228826Z 2025-05-07T20:24:54.2228829Z 2025-05-07T20:24:54.2230228Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:54.2230607Z 2025-05-07T20:24:54.2230611Z 2025-05-07T20:24:54.2230724Z 2025-05-07T20:24:54.2230734Z 2025-05-07T20:24:54.2230737Z 2025-05-07T20:24:54.2230741Z 2025-05-07T20:24:54.2230745Z 2025-05-07T20:24:54.2230748Z 2025-05-07T20:24:54.2230752Z 2025-05-07T20:24:54.2230755Z 2025-05-07T20:24:54.2230759Z 2025-05-07T20:24:54.2230763Z 2025-05-07T20:24:54.2230766Z 2025-05-07T20:24:54.2230770Z 2025-05-07T20:24:54.2230773Z 2025-05-07T20:24:54.2230777Z 2025-05-07T20:24:54.2231962Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:54.2232352Z 2025-05-07T20:24:54.2232358Z 2025-05-07T20:24:54.2232371Z 2025-05-07T20:24:54.2232376Z 2025-05-07T20:24:54.2232381Z 2025-05-07T20:24:54.2232386Z 2025-05-07T20:24:54.2232391Z 2025-05-07T20:24:54.2232397Z 2025-05-07T20:24:54.2232402Z 2025-05-07T20:24:54.2232416Z 2025-05-07T20:24:54.2232422Z 2025-05-07T20:24:54.2232427Z 2025-05-07T20:24:54.2232432Z 2025-05-07T20:24:54.2232437Z 2025-05-07T20:24:54.2232442Z 2025-05-07T20:24:54.2232455Z 2025-05-07T20:24:54.2232460Z 2025-05-07T20:24:54.2233142Z libuuid-2.38.1 | 33 KB | | 0%  2025-05-07T20:24:54.2233542Z 2025-05-07T20:24:54.2233547Z 2025-05-07T20:24:54.2233562Z 2025-05-07T20:24:54.2233567Z 2025-05-07T20:24:54.2233581Z 2025-05-07T20:24:54.2233586Z 2025-05-07T20:24:54.2233591Z 2025-05-07T20:24:54.2233596Z 2025-05-07T20:24:54.2233601Z 2025-05-07T20:24:54.2233606Z 2025-05-07T20:24:54.2233611Z 2025-05-07T20:24:54.2233628Z 2025-05-07T20:24:54.2233633Z 2025-05-07T20:24:54.2233638Z 2025-05-07T20:24:54.2233643Z 2025-05-07T20:24:54.2233648Z 2025-05-07T20:24:54.2233653Z 2025-05-07T20:24:54.2233658Z 2025-05-07T20:24:54.2234766Z libnsl-2.0.1 | 33 KB | | 0%  2025-05-07T20:24:54.2235169Z 2025-05-07T20:24:54.2235175Z 2025-05-07T20:24:54.2235180Z 2025-05-07T20:24:54.2235186Z 2025-05-07T20:24:54.2235191Z 2025-05-07T20:24:54.2235196Z 2025-05-07T20:24:54.2235210Z 2025-05-07T20:24:54.2235303Z 2025-05-07T20:24:54.2235309Z 2025-05-07T20:24:54.2235314Z 2025-05-07T20:24:54.2235319Z 2025-05-07T20:24:54.2235324Z 2025-05-07T20:24:54.2235337Z 2025-05-07T20:24:54.2235343Z 2025-05-07T20:24:54.2235348Z 2025-05-07T20:24:54.2235353Z 2025-05-07T20:24:54.2235358Z 2025-05-07T20:24:54.2235363Z 2025-05-07T20:24:54.2235368Z 2025-05-07T20:24:54.3027595Z ... (more hidden) ... 2025-05-07T20:24:54.3028991Z 2025-05-07T20:24:54.3157735Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:24:54.3158073Z 2025-05-07T20:24:54.3158399Z 2025-05-07T20:24:54.3244499Z cryptography-44.0.3 | 1.5 MB | ######3 | 64%  2025-05-07T20:24:54.3244864Z 2025-05-07T20:24:54.3244870Z 2025-05-07T20:24:54.3244875Z 2025-05-07T20:24:54.3246789Z 2025-05-07T20:24:54.3250474Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:54.3250767Z 2025-05-07T20:24:54.3250788Z 2025-05-07T20:24:54.3250791Z 2025-05-07T20:24:54.3250795Z 2025-05-07T20:24:54.3264349Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:54.3264670Z 2025-05-07T20:24:54.3264676Z 2025-05-07T20:24:54.3273566Z 2025-05-07T20:24:54.3430929Z libsqlite-3.46.0 | 845 KB | 1 | 2%  2025-05-07T20:24:54.3431190Z 2025-05-07T20:24:54.3431537Z 2025-05-07T20:24:54.3529367Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:54.3529813Z 2025-05-07T20:24:54.3529831Z 2025-05-07T20:24:54.3529836Z 2025-05-07T20:24:54.3529841Z 2025-05-07T20:24:54.3530119Z 2025-05-07T20:24:54.3600068Z libgomp-15.1.0 | 442 KB | 3 | 4%  2025-05-07T20:24:54.3764742Z python-3.12.2 | 30.8 MB | | 0% 2025-05-07T20:24:54.3764998Z 2025-05-07T20:24:54.3765003Z 2025-05-07T20:24:54.3765006Z 2025-05-07T20:24:54.3765010Z 2025-05-07T20:24:54.3773435Z 2025-05-07T20:24:54.3806873Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:54.3807394Z 2025-05-07T20:24:54.3807398Z 2025-05-07T20:24:54.3807402Z 2025-05-07T20:24:54.3811996Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:24:54.3812302Z 2025-05-07T20:24:54.3812306Z 2025-05-07T20:24:54.3812310Z 2025-05-07T20:24:54.3812314Z 2025-05-07T20:24:54.3812317Z 2025-05-07T20:24:54.3814341Z 2025-05-07T20:24:54.3964451Z cffi-1.17.1 | 288 KB | 5 | 6%  2025-05-07T20:24:54.3964723Z 2025-05-07T20:24:54.3964727Z 2025-05-07T20:24:54.3964731Z 2025-05-07T20:24:54.3964735Z 2025-05-07T20:24:54.3964738Z 2025-05-07T20:24:54.3964742Z 2025-05-07T20:24:54.4110465Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:24:54.4110747Z 2025-05-07T20:24:54.4110752Z 2025-05-07T20:24:54.4110755Z 2025-05-07T20:24:54.4110759Z 2025-05-07T20:24:54.4110762Z 2025-05-07T20:24:54.4110774Z 2025-05-07T20:24:54.4116052Z 2025-05-07T20:24:54.4125742Z expat-2.7.0 | 137 KB | #1 | 12%  2025-05-07T20:24:54.4126012Z 2025-05-07T20:24:54.4126022Z 2025-05-07T20:24:54.4126026Z 2025-05-07T20:24:54.4126030Z 2025-05-07T20:24:54.4126033Z 2025-05-07T20:24:54.4126037Z 2025-05-07T20:24:54.4126040Z 2025-05-07T20:24:54.4127512Z 2025-05-07T20:24:54.4192701Z pyopenssl-25.0.0 | 120 KB | #3 | 13%  2025-05-07T20:24:54.4193055Z 2025-05-07T20:24:54.4193061Z 2025-05-07T20:24:54.4193081Z 2025-05-07T20:24:54.4193085Z 2025-05-07T20:24:54.4193088Z 2025-05-07T20:24:54.4193092Z 2025-05-07T20:24:54.4195146Z 2025-05-07T20:24:54.4203883Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:24:54.4204226Z 2025-05-07T20:24:54.4204231Z 2025-05-07T20:24:54.4204234Z 2025-05-07T20:24:54.4204238Z 2025-05-07T20:24:54.4204241Z 2025-05-07T20:24:54.4204245Z 2025-05-07T20:24:54.4204249Z 2025-05-07T20:24:54.4206170Z 2025-05-07T20:24:54.4355262Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:54.4355568Z 2025-05-07T20:24:54.4355572Z 2025-05-07T20:24:54.4355575Z 2025-05-07T20:24:54.4355579Z 2025-05-07T20:24:54.4355582Z 2025-05-07T20:24:54.4355586Z 2025-05-07T20:24:54.4355590Z 2025-05-07T20:24:54.4355593Z 2025-05-07T20:24:54.4355600Z 2025-05-07T20:24:54.4394408Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:54.4394689Z 2025-05-07T20:24:54.4394705Z 2025-05-07T20:24:54.4394708Z 2025-05-07T20:24:54.4394712Z 2025-05-07T20:24:54.4394716Z 2025-05-07T20:24:54.4394719Z 2025-05-07T20:24:54.4394723Z 2025-05-07T20:24:54.4394727Z 2025-05-07T20:24:54.4394730Z 2025-05-07T20:24:54.4487729Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:54.4488121Z 2025-05-07T20:24:54.4488136Z 2025-05-07T20:24:54.4488141Z 2025-05-07T20:24:54.4488146Z 2025-05-07T20:24:54.4488159Z 2025-05-07T20:24:54.4488164Z 2025-05-07T20:24:54.4488182Z 2025-05-07T20:24:54.4488187Z 2025-05-07T20:24:54.4488192Z 2025-05-07T20:24:54.4488197Z 2025-05-07T20:24:54.4519313Z libxcrypt-4.4.36 | 98 KB | #6 | 16%  2025-05-07T20:24:54.4519715Z 2025-05-07T20:24:54.4519719Z 2025-05-07T20:24:54.4519723Z 2025-05-07T20:24:54.4519727Z 2025-05-07T20:24:54.4519731Z 2025-05-07T20:24:54.4519734Z 2025-05-07T20:24:54.4519738Z 2025-05-07T20:24:54.4519742Z 2025-05-07T20:24:54.4519746Z 2025-05-07T20:24:54.4520360Z 2025-05-07T20:24:54.4609405Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:24:54.4609814Z 2025-05-07T20:24:54.4609818Z 2025-05-07T20:24:54.4609822Z 2025-05-07T20:24:54.4609826Z 2025-05-07T20:24:54.4609829Z 2025-05-07T20:24:54.4609833Z 2025-05-07T20:24:54.4609845Z 2025-05-07T20:24:54.4609848Z 2025-05-07T20:24:54.4609852Z 2025-05-07T20:24:54.4609856Z 2025-05-07T20:24:54.4609859Z 2025-05-07T20:24:54.4609863Z 2025-05-07T20:24:54.4621465Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:54.4621792Z 2025-05-07T20:24:54.4621797Z 2025-05-07T20:24:54.4621800Z 2025-05-07T20:24:54.4622376Z 2025-05-07T20:24:54.4651875Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:54.4652149Z 2025-05-07T20:24:54.4652154Z 2025-05-07T20:24:54.4652158Z 2025-05-07T20:24:54.4652161Z 2025-05-07T20:24:54.4652165Z 2025-05-07T20:24:54.4652169Z 2025-05-07T20:24:54.4652184Z 2025-05-07T20:24:54.4652188Z 2025-05-07T20:24:54.4652192Z 2025-05-07T20:24:54.4652195Z 2025-05-07T20:24:54.4653622Z 2025-05-07T20:24:54.4665460Z zlib-1.2.13 | 91 KB | #7 | 18%  2025-05-07T20:24:54.4665753Z 2025-05-07T20:24:54.4665758Z 2025-05-07T20:24:54.4665761Z 2025-05-07T20:24:54.4665765Z 2025-05-07T20:24:54.4665769Z 2025-05-07T20:24:54.4665772Z 2025-05-07T20:24:54.4665776Z 2025-05-07T20:24:54.4665780Z 2025-05-07T20:24:54.4665783Z 2025-05-07T20:24:54.4665797Z 2025-05-07T20:24:54.4665801Z 2025-05-07T20:24:54.4665804Z 2025-05-07T20:24:54.4708706Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:54.4709199Z 2025-05-07T20:24:54.4709205Z 2025-05-07T20:24:54.4709210Z 2025-05-07T20:24:54.4709215Z 2025-05-07T20:24:54.4709220Z 2025-05-07T20:24:54.4709226Z 2025-05-07T20:24:54.4709239Z 2025-05-07T20:24:54.4709245Z 2025-05-07T20:24:54.4709251Z 2025-05-07T20:24:54.4709268Z 2025-05-07T20:24:54.4709274Z 2025-05-07T20:24:54.4736721Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:24:54.4984575Z python-3.12.2 | 30.8 MB | 3 | 4% 2025-05-07T20:24:54.4984835Z 2025-05-07T20:24:54.4984839Z 2025-05-07T20:24:54.4984843Z 2025-05-07T20:24:54.4984847Z 2025-05-07T20:24:54.4984851Z 2025-05-07T20:24:54.4984862Z 2025-05-07T20:24:54.4984866Z 2025-05-07T20:24:54.4984869Z 2025-05-07T20:24:54.4984873Z 2025-05-07T20:24:54.4984892Z 2025-05-07T20:24:54.4984896Z 2025-05-07T20:24:54.4984899Z 2025-05-07T20:24:54.4984906Z 2025-05-07T20:24:54.4984910Z 2025-05-07T20:24:54.4986013Z 2025-05-07T20:24:54.4989697Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:54.4990392Z 2025-05-07T20:24:54.4990397Z 2025-05-07T20:24:54.4990401Z 2025-05-07T20:24:54.4990405Z 2025-05-07T20:24:54.4990408Z 2025-05-07T20:24:54.4990412Z 2025-05-07T20:24:54.4990428Z 2025-05-07T20:24:54.4990432Z 2025-05-07T20:24:54.4990436Z 2025-05-07T20:24:54.4990439Z 2025-05-07T20:24:54.4990443Z 2025-05-07T20:24:54.4990447Z 2025-05-07T20:24:54.4990450Z 2025-05-07T20:24:54.4990828Z 2025-05-07T20:24:54.5045998Z libzlib-1.2.13 | 60 KB | ##6 | 27%  2025-05-07T20:24:54.5046356Z 2025-05-07T20:24:54.5046360Z 2025-05-07T20:24:54.5046364Z 2025-05-07T20:24:54.5046368Z 2025-05-07T20:24:54.5046372Z 2025-05-07T20:24:54.5046375Z 2025-05-07T20:24:54.5046396Z 2025-05-07T20:24:54.5046400Z 2025-05-07T20:24:54.5046404Z 2025-05-07T20:24:54.5046408Z 2025-05-07T20:24:54.5046411Z 2025-05-07T20:24:54.5046415Z 2025-05-07T20:24:54.5046418Z 2025-05-07T20:24:54.5046422Z 2025-05-07T20:24:54.5050386Z 2025-05-07T20:24:54.5058699Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:54.5059424Z 2025-05-07T20:24:54.5059432Z 2025-05-07T20:24:54.5059437Z 2025-05-07T20:24:54.5059749Z 2025-05-07T20:24:54.5059758Z 2025-05-07T20:24:54.5059763Z 2025-05-07T20:24:54.5059768Z 2025-05-07T20:24:54.5059774Z 2025-05-07T20:24:54.5059779Z 2025-05-07T20:24:54.5059784Z 2025-05-07T20:24:54.5059789Z 2025-05-07T20:24:54.5059794Z 2025-05-07T20:24:54.5059799Z 2025-05-07T20:24:54.5063323Z 2025-05-07T20:24:54.5165285Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:24:54.5165723Z 2025-05-07T20:24:54.5165729Z 2025-05-07T20:24:54.5165734Z 2025-05-07T20:24:54.5165981Z 2025-05-07T20:24:54.5165987Z 2025-05-07T20:24:54.5165992Z 2025-05-07T20:24:54.5165997Z 2025-05-07T20:24:54.5166014Z 2025-05-07T20:24:54.5166020Z 2025-05-07T20:24:54.5166025Z 2025-05-07T20:24:54.5166030Z 2025-05-07T20:24:54.5166035Z 2025-05-07T20:24:54.5166043Z 2025-05-07T20:24:54.5166049Z 2025-05-07T20:24:54.5166054Z 2025-05-07T20:24:54.5166059Z 2025-05-07T20:24:54.5192287Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:54.5192616Z 2025-05-07T20:24:54.5192621Z 2025-05-07T20:24:54.5192624Z 2025-05-07T20:24:54.5192628Z 2025-05-07T20:24:54.5192632Z 2025-05-07T20:24:54.5192635Z 2025-05-07T20:24:54.5192639Z 2025-05-07T20:24:54.5192643Z 2025-05-07T20:24:54.5192647Z 2025-05-07T20:24:54.5192650Z 2025-05-07T20:24:54.5192654Z 2025-05-07T20:24:54.5192658Z 2025-05-07T20:24:54.5192662Z 2025-05-07T20:24:54.5213037Z libexpat-2.7.0 | 73 KB | ##2 | 22%  2025-05-07T20:24:54.5213517Z 2025-05-07T20:24:54.5213521Z 2025-05-07T20:24:54.5213525Z 2025-05-07T20:24:54.5213529Z 2025-05-07T20:24:54.5213533Z 2025-05-07T20:24:54.5213536Z 2025-05-07T20:24:54.5213540Z 2025-05-07T20:24:54.5213544Z 2025-05-07T20:24:54.5213548Z 2025-05-07T20:24:54.5213551Z 2025-05-07T20:24:54.5213555Z 2025-05-07T20:24:54.5213559Z 2025-05-07T20:24:54.5213562Z 2025-05-07T20:24:54.5213572Z 2025-05-07T20:24:54.5213576Z 2025-05-07T20:24:54.5218077Z 2025-05-07T20:24:54.5289172Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:54.5289595Z 2025-05-07T20:24:54.5289602Z 2025-05-07T20:24:54.5289607Z 2025-05-07T20:24:54.5289612Z 2025-05-07T20:24:54.5289617Z 2025-05-07T20:24:54.5289623Z 2025-05-07T20:24:54.5289628Z 2025-05-07T20:24:54.5289633Z 2025-05-07T20:24:54.5289638Z 2025-05-07T20:24:54.5289644Z 2025-05-07T20:24:54.5289649Z 2025-05-07T20:24:54.5289654Z 2025-05-07T20:24:54.5289881Z 2025-05-07T20:24:54.5390479Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:24:54.5390887Z 2025-05-07T20:24:54.5390893Z 2025-05-07T20:24:54.5390898Z 2025-05-07T20:24:54.5390903Z 2025-05-07T20:24:54.5390908Z 2025-05-07T20:24:54.5390913Z 2025-05-07T20:24:54.5390919Z 2025-05-07T20:24:54.5390924Z 2025-05-07T20:24:54.5390929Z 2025-05-07T20:24:54.5390934Z 2025-05-07T20:24:54.5390939Z 2025-05-07T20:24:54.5390945Z 2025-05-07T20:24:54.5390950Z 2025-05-07T20:24:54.5390968Z 2025-05-07T20:24:54.5390973Z 2025-05-07T20:24:54.5390987Z 2025-05-07T20:24:54.5390992Z 2025-05-07T20:24:54.5433221Z libuuid-2.38.1 | 33 KB | ####8 | 49%  2025-05-07T20:24:54.5433628Z 2025-05-07T20:24:54.5433635Z 2025-05-07T20:24:54.5433648Z 2025-05-07T20:24:54.5433653Z 2025-05-07T20:24:54.5433658Z 2025-05-07T20:24:54.5433663Z 2025-05-07T20:24:54.5433669Z 2025-05-07T20:24:54.5433673Z 2025-05-07T20:24:54.5433679Z 2025-05-07T20:24:54.5433694Z 2025-05-07T20:24:54.5433699Z 2025-05-07T20:24:54.5433704Z 2025-05-07T20:24:54.5433709Z 2025-05-07T20:24:54.5433714Z 2025-05-07T20:24:54.5433719Z 2025-05-07T20:24:54.5433724Z 2025-05-07T20:24:54.5436354Z 2025-05-07T20:24:54.5667267Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:24:54.5667666Z 2025-05-07T20:24:54.5667672Z 2025-05-07T20:24:54.5667677Z 2025-05-07T20:24:54.5667682Z 2025-05-07T20:24:54.5667952Z 2025-05-07T20:24:54.5667961Z 2025-05-07T20:24:54.5667967Z 2025-05-07T20:24:54.5667972Z 2025-05-07T20:24:54.5667977Z 2025-05-07T20:24:54.5667982Z 2025-05-07T20:24:54.5667995Z 2025-05-07T20:24:54.5668011Z 2025-05-07T20:24:54.5668017Z 2025-05-07T20:24:54.5668022Z 2025-05-07T20:24:54.5668025Z 2025-05-07T20:24:54.5668029Z 2025-05-07T20:24:54.5668032Z 2025-05-07T20:24:54.5668036Z 2025-05-07T20:24:54.5697673Z libnsl-2.0.1 | 33 KB | ####9 | 49%  2025-05-07T20:24:54.5698293Z 2025-05-07T20:24:54.5698299Z 2025-05-07T20:24:54.5698304Z 2025-05-07T20:24:54.5698309Z 2025-05-07T20:24:54.5698314Z 2025-05-07T20:24:54.5698319Z 2025-05-07T20:24:54.5698324Z 2025-05-07T20:24:54.5698330Z 2025-05-07T20:24:54.5698335Z 2025-05-07T20:24:54.5698340Z 2025-05-07T20:24:54.5698345Z 2025-05-07T20:24:54.5698350Z 2025-05-07T20:24:54.5698355Z 2025-05-07T20:24:54.5698360Z 2025-05-07T20:24:54.5698365Z 2025-05-07T20:24:54.5698378Z 2025-05-07T20:24:54.5698383Z 2025-05-07T20:24:54.5700096Z 2025-05-07T20:24:54.5739271Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:24:54.5750384Z python-3.12.2 | 30.8 MB | #4 | 15% 2025-05-07T20:24:54.5750713Z 2025-05-07T20:24:54.5750718Z 2025-05-07T20:24:54.5750723Z 2025-05-07T20:24:54.5750728Z 2025-05-07T20:24:54.5750734Z 2025-05-07T20:24:54.5750804Z 2025-05-07T20:24:54.5750810Z 2025-05-07T20:24:54.5750815Z 2025-05-07T20:24:54.5750834Z 2025-05-07T20:24:54.5750839Z 2025-05-07T20:24:54.5750845Z 2025-05-07T20:24:54.5751064Z 2025-05-07T20:24:54.5751076Z 2025-05-07T20:24:54.5751082Z 2025-05-07T20:24:54.5751087Z 2025-05-07T20:24:54.5751093Z 2025-05-07T20:24:54.5751098Z 2025-05-07T20:24:54.5751104Z 2025-05-07T20:24:54.5751108Z 2025-05-07T20:24:54.5788476Z ... (more hidden) ... 2025-05-07T20:24:54.5788835Z 2025-05-07T20:24:54.5788852Z 2025-05-07T20:24:54.5788858Z 2025-05-07T20:24:54.5788864Z 2025-05-07T20:24:54.5788869Z 2025-05-07T20:24:54.5788875Z 2025-05-07T20:24:54.5788881Z 2025-05-07T20:24:54.5788886Z 2025-05-07T20:24:54.5788891Z 2025-05-07T20:24:54.5788895Z 2025-05-07T20:24:54.5788908Z 2025-05-07T20:24:54.5788913Z 2025-05-07T20:24:54.5788918Z 2025-05-07T20:24:54.5788923Z 2025-05-07T20:24:54.5788927Z 2025-05-07T20:24:54.5788931Z 2025-05-07T20:24:54.5788935Z 2025-05-07T20:24:54.5788939Z 2025-05-07T20:24:54.5788943Z 2025-05-07T20:24:54.5899738Z ... (more hidden) ... 2025-05-07T20:24:54.5900022Z 2025-05-07T20:24:54.5900026Z 2025-05-07T20:24:54.5900030Z 2025-05-07T20:24:54.5900034Z 2025-05-07T20:24:54.5900037Z 2025-05-07T20:24:54.5907125Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:54.5907437Z 2025-05-07T20:24:54.5907443Z 2025-05-07T20:24:54.5907448Z 2025-05-07T20:24:54.5907453Z 2025-05-07T20:24:54.5907457Z 2025-05-07T20:24:54.6626664Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:54.6626969Z 2025-05-07T20:24:54.6626975Z 2025-05-07T20:24:54.6627351Z 2025-05-07T20:24:54.6634102Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:24:54.6634384Z 2025-05-07T20:24:54.6634389Z 2025-05-07T20:24:54.6634394Z 2025-05-07T20:24:54.6743307Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:24:54.7746445Z python-3.12.2 | 30.8 MB | ##9 | 30% 2025-05-07T20:24:54.7803619Z python-3.12.2 | 30.8 MB | ####3 | 44% 2025-05-07T20:24:54.7803856Z 2025-05-07T20:24:54.7804155Z 2025-05-07T20:24:54.7804164Z 2025-05-07T20:24:54.7804246Z 2025-05-07T20:24:54.7804253Z 2025-05-07T20:24:54.7804271Z 2025-05-07T20:24:54.7819067Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:24:54.7819343Z 2025-05-07T20:24:54.7819349Z 2025-05-07T20:24:54.7819353Z 2025-05-07T20:24:54.7819358Z 2025-05-07T20:24:54.7819621Z 2025-05-07T20:24:54.7819631Z 2025-05-07T20:24:54.8208442Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:24:54.8211780Z 2025-05-07T20:24:54.8222416Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:24:54.8224675Z 2025-05-07T20:24:54.8445657Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:24:54.8445954Z 2025-05-07T20:24:54.8445960Z 2025-05-07T20:24:54.8445964Z 2025-05-07T20:24:54.8445969Z 2025-05-07T20:24:54.8446220Z 2025-05-07T20:24:54.8446225Z 2025-05-07T20:24:54.8446266Z 2025-05-07T20:24:54.8457945Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:24:54.8458292Z 2025-05-07T20:24:54.8458298Z 2025-05-07T20:24:54.8458303Z 2025-05-07T20:24:54.8458308Z 2025-05-07T20:24:54.8458313Z 2025-05-07T20:24:54.8458318Z 2025-05-07T20:24:54.8458977Z 2025-05-07T20:24:54.8747665Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:24:54.9306860Z python-3.12.2 | 30.8 MB | #####6 | 57% 2025-05-07T20:24:54.9307126Z 2025-05-07T20:24:54.9307132Z 2025-05-07T20:24:54.9307138Z 2025-05-07T20:24:54.9307144Z 2025-05-07T20:24:54.9307149Z 2025-05-07T20:24:54.9307154Z 2025-05-07T20:24:54.9307160Z 2025-05-07T20:24:54.9307782Z 2025-05-07T20:24:54.9323099Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:54.9323511Z 2025-05-07T20:24:54.9323520Z 2025-05-07T20:24:54.9323529Z 2025-05-07T20:24:54.9323560Z 2025-05-07T20:24:54.9323570Z 2025-05-07T20:24:54.9323579Z 2025-05-07T20:24:54.9323589Z 2025-05-07T20:24:54.9324953Z 2025-05-07T20:24:54.9752403Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:54.9907411Z python-3.12.2 | 30.8 MB | #######2 | 72% 2025-05-07T20:24:54.9907754Z 2025-05-07T20:24:54.9907762Z 2025-05-07T20:24:54.9907768Z 2025-05-07T20:24:54.9907774Z 2025-05-07T20:24:54.9907779Z 2025-05-07T20:24:54.9907812Z 2025-05-07T20:24:54.9907818Z 2025-05-07T20:24:54.9907823Z 2025-05-07T20:24:54.9907828Z 2025-05-07T20:24:54.9909252Z 2025-05-07T20:24:54.9913457Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:24:54.9913842Z 2025-05-07T20:24:54.9913848Z 2025-05-07T20:24:54.9913862Z 2025-05-07T20:24:54.9913868Z 2025-05-07T20:24:54.9913873Z 2025-05-07T20:24:54.9913878Z 2025-05-07T20:24:54.9913883Z 2025-05-07T20:24:54.9913888Z 2025-05-07T20:24:54.9913893Z 2025-05-07T20:24:54.9913911Z 2025-05-07T20:24:54.9987819Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:24:54.9988216Z 2025-05-07T20:24:54.9989303Z 2025-05-07T20:24:55.0259577Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:55.0259967Z 2025-05-07T20:24:55.0259973Z 2025-05-07T20:24:55.0259978Z 2025-05-07T20:24:55.0259983Z 2025-05-07T20:24:55.0259988Z 2025-05-07T20:24:55.0259994Z 2025-05-07T20:24:55.0259999Z 2025-05-07T20:24:55.0260027Z 2025-05-07T20:24:55.0260032Z 2025-05-07T20:24:55.0260038Z 2025-05-07T20:24:55.0260043Z 2025-05-07T20:24:55.0260073Z 2025-05-07T20:24:55.0272141Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:55.0272563Z 2025-05-07T20:24:55.0272569Z 2025-05-07T20:24:55.0272574Z 2025-05-07T20:24:55.0272579Z 2025-05-07T20:24:55.0272594Z 2025-05-07T20:24:55.0272599Z 2025-05-07T20:24:55.0272604Z 2025-05-07T20:24:55.0272609Z 2025-05-07T20:24:55.0272625Z 2025-05-07T20:24:55.0272630Z 2025-05-07T20:24:55.0272635Z 2025-05-07T20:24:55.0272640Z 2025-05-07T20:24:55.0464592Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:55.0465025Z 2025-05-07T20:24:55.0465031Z 2025-05-07T20:24:55.0465036Z 2025-05-07T20:24:55.0465041Z 2025-05-07T20:24:55.0465046Z 2025-05-07T20:24:55.0465051Z 2025-05-07T20:24:55.0465057Z 2025-05-07T20:24:55.0465063Z 2025-05-07T20:24:55.0465068Z 2025-05-07T20:24:55.0465317Z 2025-05-07T20:24:55.0466240Z 2025-05-07T20:24:55.0475902Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:24:55.0476261Z 2025-05-07T20:24:55.0476267Z 2025-05-07T20:24:55.0476272Z 2025-05-07T20:24:55.0476277Z 2025-05-07T20:24:55.0476282Z 2025-05-07T20:24:55.0476287Z 2025-05-07T20:24:55.0476292Z 2025-05-07T20:24:55.0476297Z 2025-05-07T20:24:55.0476303Z 2025-05-07T20:24:55.0476308Z 2025-05-07T20:24:55.0479203Z 2025-05-07T20:24:55.0659003Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:24:55.0659827Z 2025-05-07T20:24:55.0659833Z 2025-05-07T20:24:55.0659838Z 2025-05-07T20:24:55.0659843Z 2025-05-07T20:24:55.0659856Z 2025-05-07T20:24:55.0659862Z 2025-05-07T20:24:55.0659867Z 2025-05-07T20:24:55.0659872Z 2025-05-07T20:24:55.0659877Z 2025-05-07T20:24:55.0659882Z 2025-05-07T20:24:55.0659887Z 2025-05-07T20:24:55.0659893Z 2025-05-07T20:24:55.0659898Z 2025-05-07T20:24:55.0659911Z 2025-05-07T20:24:55.0660575Z 2025-05-07T20:24:55.0666821Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:55.0667262Z 2025-05-07T20:24:55.0667268Z 2025-05-07T20:24:55.0667273Z 2025-05-07T20:24:55.0667278Z 2025-05-07T20:24:55.0667283Z 2025-05-07T20:24:55.0667288Z 2025-05-07T20:24:55.0667293Z 2025-05-07T20:24:55.0667299Z 2025-05-07T20:24:55.0667303Z 2025-05-07T20:24:55.0667309Z 2025-05-07T20:24:55.0667314Z 2025-05-07T20:24:55.0667319Z 2025-05-07T20:24:55.0667337Z 2025-05-07T20:24:55.0667342Z 2025-05-07T20:24:55.0667347Z 2025-05-07T20:24:55.0752035Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:55.0864370Z python-3.12.2 | 30.8 MB | ########6 | 86% 2025-05-07T20:24:55.0864699Z 2025-05-07T20:24:55.0864861Z 2025-05-07T20:24:55.0864867Z 2025-05-07T20:24:55.0864873Z 2025-05-07T20:24:55.0864895Z 2025-05-07T20:24:55.0864900Z 2025-05-07T20:24:55.0864919Z 2025-05-07T20:24:55.0864924Z 2025-05-07T20:24:55.0864929Z 2025-05-07T20:24:55.0864934Z 2025-05-07T20:24:55.0864939Z 2025-05-07T20:24:55.0864944Z 2025-05-07T20:24:55.0864949Z 2025-05-07T20:24:55.0864958Z 2025-05-07T20:24:55.0870795Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:24:55.0871202Z 2025-05-07T20:24:55.0871208Z 2025-05-07T20:24:55.0871213Z 2025-05-07T20:24:55.0871219Z 2025-05-07T20:24:55.0871224Z 2025-05-07T20:24:55.0871229Z 2025-05-07T20:24:55.0871246Z 2025-05-07T20:24:55.0871251Z 2025-05-07T20:24:55.0871256Z 2025-05-07T20:24:55.0871261Z 2025-05-07T20:24:55.0871266Z 2025-05-07T20:24:55.0871271Z 2025-05-07T20:24:55.0871277Z 2025-05-07T20:24:55.0871282Z 2025-05-07T20:24:55.1016758Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:24:55.1017157Z 2025-05-07T20:24:55.1017169Z 2025-05-07T20:24:55.1017175Z 2025-05-07T20:24:55.1017180Z 2025-05-07T20:24:55.1017197Z 2025-05-07T20:24:55.1017203Z 2025-05-07T20:24:55.1017208Z 2025-05-07T20:24:55.1017225Z 2025-05-07T20:24:55.1019399Z 2025-05-07T20:24:55.1029908Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:55.1030300Z 2025-05-07T20:24:55.1030305Z 2025-05-07T20:24:55.1030320Z 2025-05-07T20:24:55.1030325Z 2025-05-07T20:24:55.1030330Z 2025-05-07T20:24:55.1030335Z 2025-05-07T20:24:55.1030340Z 2025-05-07T20:24:55.1030345Z 2025-05-07T20:24:55.1033194Z 2025-05-07T20:24:55.1202771Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:55.1203173Z 2025-05-07T20:24:55.1203179Z 2025-05-07T20:24:55.1203185Z 2025-05-07T20:24:55.1203190Z 2025-05-07T20:24:55.1203195Z 2025-05-07T20:24:55.1203200Z 2025-05-07T20:24:55.1203206Z 2025-05-07T20:24:55.1203211Z 2025-05-07T20:24:55.1203216Z 2025-05-07T20:24:55.1203221Z 2025-05-07T20:24:55.1203226Z 2025-05-07T20:24:55.1203231Z 2025-05-07T20:24:55.1203236Z 2025-05-07T20:24:55.1214553Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:24:55.1214950Z 2025-05-07T20:24:55.1214956Z 2025-05-07T20:24:55.1214961Z 2025-05-07T20:24:55.1214967Z 2025-05-07T20:24:55.1214972Z 2025-05-07T20:24:55.1214977Z 2025-05-07T20:24:55.1214982Z 2025-05-07T20:24:55.1214995Z 2025-05-07T20:24:55.1215000Z 2025-05-07T20:24:55.1215005Z 2025-05-07T20:24:55.1215011Z 2025-05-07T20:24:55.1215024Z 2025-05-07T20:24:55.1215029Z 2025-05-07T20:24:55.1413147Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:24:55.1413627Z 2025-05-07T20:24:55.1413641Z 2025-05-07T20:24:55.1413646Z 2025-05-07T20:24:55.1413651Z 2025-05-07T20:24:55.1413656Z 2025-05-07T20:24:55.1413661Z 2025-05-07T20:24:55.1413666Z 2025-05-07T20:24:55.1413671Z 2025-05-07T20:24:55.1413676Z 2025-05-07T20:24:55.1413681Z 2025-05-07T20:24:55.1413686Z 2025-05-07T20:24:55.1413691Z 2025-05-07T20:24:55.1413697Z 2025-05-07T20:24:55.1413716Z 2025-05-07T20:24:55.1413722Z 2025-05-07T20:24:55.1413727Z 2025-05-07T20:24:55.1413732Z 2025-05-07T20:24:55.1426358Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:24:55.1426778Z 2025-05-07T20:24:55.1426784Z 2025-05-07T20:24:55.1426789Z 2025-05-07T20:24:55.1426794Z 2025-05-07T20:24:55.1426799Z 2025-05-07T20:24:55.1426804Z 2025-05-07T20:24:55.1426809Z 2025-05-07T20:24:55.1426815Z 2025-05-07T20:24:55.1426820Z 2025-05-07T20:24:55.1426835Z 2025-05-07T20:24:55.1426841Z 2025-05-07T20:24:55.1426846Z 2025-05-07T20:24:55.1426859Z 2025-05-07T20:24:55.1426864Z 2025-05-07T20:24:55.1426869Z 2025-05-07T20:24:55.1426875Z 2025-05-07T20:24:55.1427023Z 2025-05-07T20:24:55.1649689Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:24:55.1650096Z 2025-05-07T20:24:55.1650101Z 2025-05-07T20:24:55.1650107Z 2025-05-07T20:24:55.1650112Z 2025-05-07T20:24:55.1650132Z 2025-05-07T20:24:55.1650138Z 2025-05-07T20:24:55.1650143Z 2025-05-07T20:24:55.1650148Z 2025-05-07T20:24:55.1650153Z 2025-05-07T20:24:55.1650159Z 2025-05-07T20:24:55.1650164Z 2025-05-07T20:24:55.1650181Z 2025-05-07T20:24:55.1650186Z 2025-05-07T20:24:55.1650191Z 2025-05-07T20:24:55.1650195Z 2025-05-07T20:24:55.1650201Z 2025-05-07T20:24:55.1650206Z 2025-05-07T20:24:55.1650211Z 2025-05-07T20:24:55.1663339Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:24:55.1663769Z 2025-05-07T20:24:55.1663774Z 2025-05-07T20:24:55.1663780Z 2025-05-07T20:24:55.1663785Z 2025-05-07T20:24:55.1663790Z 2025-05-07T20:24:55.1663795Z 2025-05-07T20:24:55.1663800Z 2025-05-07T20:24:55.1663805Z 2025-05-07T20:24:55.1663810Z 2025-05-07T20:24:55.1663815Z 2025-05-07T20:24:55.1663820Z 2025-05-07T20:24:55.1663825Z 2025-05-07T20:24:55.1663830Z 2025-05-07T20:24:55.1663835Z 2025-05-07T20:24:55.1663840Z 2025-05-07T20:24:55.1663845Z 2025-05-07T20:24:55.1663855Z 2025-05-07T20:24:55.1665762Z 2025-05-07T20:24:55.1704600Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:24:55.1705008Z 2025-05-07T20:24:55.1705014Z 2025-05-07T20:24:55.1705019Z 2025-05-07T20:24:55.1705024Z 2025-05-07T20:24:55.1705042Z 2025-05-07T20:24:55.1705047Z 2025-05-07T20:24:55.1705052Z 2025-05-07T20:24:55.1705057Z 2025-05-07T20:24:55.1705062Z 2025-05-07T20:24:55.1705067Z 2025-05-07T20:24:55.1705072Z 2025-05-07T20:24:55.1705086Z 2025-05-07T20:24:55.1705091Z 2025-05-07T20:24:55.1705096Z 2025-05-07T20:24:55.1705101Z 2025-05-07T20:24:55.1705106Z 2025-05-07T20:24:55.1705111Z 2025-05-07T20:24:55.1705116Z 2025-05-07T20:24:55.1705122Z 2025-05-07T20:24:55.1714016Z ... (more hidden) ... 2025-05-07T20:24:55.1714396Z 2025-05-07T20:24:55.1714402Z 2025-05-07T20:24:55.1714407Z 2025-05-07T20:24:55.1714412Z 2025-05-07T20:24:55.1714645Z 2025-05-07T20:24:55.1714651Z 2025-05-07T20:24:55.1714656Z 2025-05-07T20:24:55.1714661Z 2025-05-07T20:24:55.1714666Z 2025-05-07T20:24:55.1714671Z 2025-05-07T20:24:55.1714676Z 2025-05-07T20:24:55.1714682Z 2025-05-07T20:24:55.1714687Z 2025-05-07T20:24:55.1714700Z 2025-05-07T20:24:55.1714706Z 2025-05-07T20:24:55.1714711Z 2025-05-07T20:24:55.1715635Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:55.1716025Z 2025-05-07T20:24:55.1716031Z 2025-05-07T20:24:55.1716231Z 2025-05-07T20:24:55.1716237Z 2025-05-07T20:24:55.1716242Z 2025-05-07T20:24:55.1716247Z 2025-05-07T20:24:55.1716252Z 2025-05-07T20:24:55.1716257Z 2025-05-07T20:24:55.1716262Z 2025-05-07T20:24:55.1716267Z 2025-05-07T20:24:55.1716273Z 2025-05-07T20:24:55.1716284Z 2025-05-07T20:24:55.1716289Z 2025-05-07T20:24:55.1716295Z 2025-05-07T20:24:55.1716300Z 2025-05-07T20:24:55.1716305Z 2025-05-07T20:24:55.1992031Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:55.1992618Z python-3.12.2 | 30.8 MB | ########## | 100% 2025-05-07T20:24:55.8902408Z python-3.12.2 | 30.8 MB | ########## | 100% 2025-05-07T20:24:55.8908626Z python-3.12.2 | 30.8 MB | ########## | 100% 2025-05-07T20:24:55.8908969Z 2025-05-07T20:24:55.8908975Z 2025-05-07T20:24:55.8908980Z 2025-05-07T20:24:55.8908986Z 2025-05-07T20:24:55.8908992Z 2025-05-07T20:24:55.8908999Z 2025-05-07T20:24:55.8909005Z 2025-05-07T20:24:55.8909026Z 2025-05-07T20:24:55.8909031Z 2025-05-07T20:24:55.8909037Z 2025-05-07T20:24:55.8909043Z 2025-05-07T20:24:55.8909058Z 2025-05-07T20:24:55.8909063Z 2025-05-07T20:24:55.8909068Z 2025-05-07T20:24:55.8909072Z 2025-05-07T20:24:55.8909077Z 2025-05-07T20:24:55.8909082Z 2025-05-07T20:24:55.8909087Z 2025-05-07T20:24:55.8909092Z 2025-05-07T20:24:55.8909212Z 2025-05-07T20:24:55.8909693Z  2025-05-07T20:24:55.8910141Z 2025-05-07T20:24:55.8910407Z 2025-05-07T20:24:55.8910641Z  2025-05-07T20:24:55.8910919Z 2025-05-07T20:24:55.8910925Z 2025-05-07T20:24:55.8911170Z  2025-05-07T20:24:55.8911453Z 2025-05-07T20:24:55.8911459Z 2025-05-07T20:24:55.8911465Z 2025-05-07T20:24:55.8911696Z  2025-05-07T20:24:55.8911996Z 2025-05-07T20:24:55.8912002Z 2025-05-07T20:24:55.8912007Z 2025-05-07T20:24:55.8912012Z 2025-05-07T20:24:55.8912248Z  2025-05-07T20:24:55.8912539Z 2025-05-07T20:24:55.8912544Z 2025-05-07T20:24:55.8912550Z 2025-05-07T20:24:55.8912555Z 2025-05-07T20:24:55.8912560Z 2025-05-07T20:24:55.8912793Z  2025-05-07T20:24:55.8913084Z 2025-05-07T20:24:55.8913097Z 2025-05-07T20:24:55.8913102Z 2025-05-07T20:24:55.8913107Z 2025-05-07T20:24:55.8913112Z 2025-05-07T20:24:55.8913117Z 2025-05-07T20:24:55.8913358Z  2025-05-07T20:24:55.8913656Z 2025-05-07T20:24:55.8913661Z 2025-05-07T20:24:55.8913666Z 2025-05-07T20:24:55.8913672Z 2025-05-07T20:24:55.8913677Z 2025-05-07T20:24:55.8913682Z 2025-05-07T20:24:55.8913687Z 2025-05-07T20:24:55.8913958Z  2025-05-07T20:24:55.8914270Z 2025-05-07T20:24:55.8914275Z 2025-05-07T20:24:55.8914280Z 2025-05-07T20:24:55.8914285Z 2025-05-07T20:24:55.8914291Z 2025-05-07T20:24:55.8914296Z 2025-05-07T20:24:55.8914301Z 2025-05-07T20:24:55.8914306Z 2025-05-07T20:24:55.8914567Z  2025-05-07T20:24:55.8914861Z 2025-05-07T20:24:55.8914866Z 2025-05-07T20:24:55.8914871Z 2025-05-07T20:24:55.8914876Z 2025-05-07T20:24:55.8915105Z 2025-05-07T20:24:55.8915112Z 2025-05-07T20:24:55.8915116Z 2025-05-07T20:24:55.8915121Z 2025-05-07T20:24:55.8915126Z 2025-05-07T20:24:55.8915391Z  2025-05-07T20:24:55.8915615Z 2025-05-07T20:24:55.8915618Z 2025-05-07T20:24:55.8915622Z 2025-05-07T20:24:55.8915625Z 2025-05-07T20:24:55.8915629Z 2025-05-07T20:24:55.8915632Z 2025-05-07T20:24:55.8915636Z 2025-05-07T20:24:55.8915639Z 2025-05-07T20:24:55.8915643Z 2025-05-07T20:24:55.8915775Z 2025-05-07T20:24:55.8915978Z  2025-05-07T20:24:55.8916205Z 2025-05-07T20:24:55.8916209Z 2025-05-07T20:24:55.8916212Z 2025-05-07T20:24:55.8916216Z 2025-05-07T20:24:55.8916219Z 2025-05-07T20:24:55.8916223Z 2025-05-07T20:24:55.8916226Z 2025-05-07T20:24:55.8916230Z 2025-05-07T20:24:55.8916239Z 2025-05-07T20:24:55.8916243Z 2025-05-07T20:24:55.8916246Z 2025-05-07T20:24:55.8916444Z  2025-05-07T20:24:55.8916663Z 2025-05-07T20:24:55.8916667Z 2025-05-07T20:24:55.8916671Z 2025-05-07T20:24:55.8916674Z 2025-05-07T20:24:55.8916688Z 2025-05-07T20:24:55.8916692Z 2025-05-07T20:24:55.8916696Z 2025-05-07T20:24:55.8916699Z 2025-05-07T20:24:55.8916703Z 2025-05-07T20:24:55.8916706Z 2025-05-07T20:24:55.8916710Z 2025-05-07T20:24:55.8916713Z 2025-05-07T20:24:55.8916907Z  2025-05-07T20:24:55.8917142Z 2025-05-07T20:24:55.8917145Z 2025-05-07T20:24:55.8917149Z 2025-05-07T20:24:55.8917152Z 2025-05-07T20:24:55.8917156Z 2025-05-07T20:24:55.8917160Z 2025-05-07T20:24:55.8917163Z 2025-05-07T20:24:55.8917167Z 2025-05-07T20:24:55.8917170Z 2025-05-07T20:24:55.8917174Z 2025-05-07T20:24:55.8917177Z 2025-05-07T20:24:55.8917181Z 2025-05-07T20:24:55.8917184Z 2025-05-07T20:24:55.8917387Z  2025-05-07T20:24:55.8917617Z 2025-05-07T20:24:55.8917621Z 2025-05-07T20:24:55.8917624Z 2025-05-07T20:24:55.8917628Z 2025-05-07T20:24:55.8917631Z 2025-05-07T20:24:55.8917635Z 2025-05-07T20:24:55.8917639Z 2025-05-07T20:24:55.8917642Z 2025-05-07T20:24:55.8917646Z 2025-05-07T20:24:55.8917649Z 2025-05-07T20:24:55.8917653Z 2025-05-07T20:24:55.8917657Z 2025-05-07T20:24:55.8917660Z 2025-05-07T20:24:55.8917664Z 2025-05-07T20:24:55.8917874Z  2025-05-07T20:24:55.8918104Z 2025-05-07T20:24:55.8918108Z 2025-05-07T20:24:55.8918112Z 2025-05-07T20:24:55.8918115Z 2025-05-07T20:24:55.8918119Z 2025-05-07T20:24:55.8918123Z 2025-05-07T20:24:55.8918126Z 2025-05-07T20:24:55.8918130Z 2025-05-07T20:24:55.8918139Z 2025-05-07T20:24:55.8918143Z 2025-05-07T20:24:55.8918146Z 2025-05-07T20:24:55.8918150Z 2025-05-07T20:24:55.8918154Z 2025-05-07T20:24:55.8918157Z 2025-05-07T20:24:55.8918166Z 2025-05-07T20:24:55.8918452Z  2025-05-07T20:24:55.8918738Z 2025-05-07T20:24:55.8918743Z 2025-05-07T20:24:55.8918747Z 2025-05-07T20:24:55.8918752Z 2025-05-07T20:24:55.8918756Z 2025-05-07T20:24:55.8918761Z 2025-05-07T20:24:55.8918765Z 2025-05-07T20:24:55.8918770Z 2025-05-07T20:24:55.8918774Z 2025-05-07T20:24:55.8918779Z 2025-05-07T20:24:55.8918783Z 2025-05-07T20:24:55.8918796Z 2025-05-07T20:24:55.8918806Z 2025-05-07T20:24:55.8918810Z 2025-05-07T20:24:55.8918815Z 2025-05-07T20:24:55.8918819Z 2025-05-07T20:24:55.8919083Z  2025-05-07T20:24:55.8919376Z 2025-05-07T20:24:55.8919380Z 2025-05-07T20:24:55.8919392Z 2025-05-07T20:24:55.8919396Z 2025-05-07T20:24:55.8919401Z 2025-05-07T20:24:55.8919405Z 2025-05-07T20:24:55.8919410Z 2025-05-07T20:24:55.8919414Z 2025-05-07T20:24:55.8919533Z 2025-05-07T20:24:55.8919539Z 2025-05-07T20:24:55.8919543Z 2025-05-07T20:24:55.8919548Z 2025-05-07T20:24:55.8919552Z 2025-05-07T20:24:55.8919557Z 2025-05-07T20:24:55.8919561Z 2025-05-07T20:24:55.8919566Z 2025-05-07T20:24:55.8919570Z 2025-05-07T20:24:55.8919845Z  2025-05-07T20:24:55.8920145Z 2025-05-07T20:24:55.8920150Z 2025-05-07T20:24:55.8920154Z 2025-05-07T20:24:55.8920159Z 2025-05-07T20:24:55.8920255Z 2025-05-07T20:24:55.8920260Z 2025-05-07T20:24:55.8920264Z 2025-05-07T20:24:55.8920269Z 2025-05-07T20:24:55.8920273Z 2025-05-07T20:24:55.8920278Z 2025-05-07T20:24:55.8920282Z 2025-05-07T20:24:55.8920287Z 2025-05-07T20:24:55.8920291Z 2025-05-07T20:24:55.8920296Z 2025-05-07T20:24:55.8920300Z 2025-05-07T20:24:55.8920314Z 2025-05-07T20:24:55.8920318Z 2025-05-07T20:24:55.8920326Z 2025-05-07T20:24:55.8920610Z  2025-05-07T20:24:55.8920907Z 2025-05-07T20:24:55.8921015Z done 2025-05-07T20:24:55.9920019Z Preparing transaction: \ done 2025-05-07T20:24:56.7516922Z Verifying transaction: / - \ | / - \ done 2025-05-07T20:24:58.3548861Z Executing transaction: / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:58.7080937Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:25:00.4524721Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:25:00.4537208Z [SETUP] Installing libxcrypt ... 2025-05-07T20:25:00.4560602Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:25:01.3202449Z Channels: 2025-05-07T20:25:01.3202694Z - conda-forge 2025-05-07T20:25:01.3202938Z Platform: linux-64 2025-05-07T20:25:04.7570057Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:05.1247920Z Solving environment: \ done 2025-05-07T20:25:05.1613586Z 2025-05-07T20:25:05.1614139Z # All requested packages already installed. 2025-05-07T20:25:05.1614405Z 2025-05-07T20:25:08.5298721Z [SETUP] Copying over ... 2025-05-07T20:25:08.5300067Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.12/crypt.h 2025-05-07T20:25:08.5301147Z 2025-05-07T20:25:08.5327409Z 2025-05-07T20:25:10.1700784Z [SETUP] Installed Python version: Python 3.12.2 2025-05-07T20:25:10.1701239Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:25:10.1734615Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:10.1735080Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:10.1748807Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:10.1749158Z env: 2025-05-07T20:25:10.1749388Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:10.1749682Z BUILD_ENV: build_binary 2025-05-07T20:25:10.1749933Z BUILD_TARGET: genai 2025-05-07T20:25:10.1750162Z BUILD_VARIANT: cuda 2025-05-07T20:25:10.1750395Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:10.1750652Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:10.1750956Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:10.1751280Z ##[endgroup] 2025-05-07T20:25:10.5085903Z ################################################################################ 2025-05-07T20:25:10.5086257Z # Install C/C++ Compilers 2025-05-07T20:25:10.5086504Z # 2025-05-07T20:25:10.5102769Z # [2025-05-07T20:25:10.509Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:25:10.5103174Z ################################################################################ 2025-05-07T20:25:10.5103395Z 2025-05-07T20:25:10.5120023Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:10.6161933Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:10.6172557Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:25:10.6195137Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:25:11.4878798Z Channels: 2025-05-07T20:25:11.4879422Z - conda-forge 2025-05-07T20:25:11.4879977Z Platform: linux-64 2025-05-07T20:25:14.8256426Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:15.1964556Z Solving environment: \ done 2025-05-07T20:25:15.2604635Z 2025-05-07T20:25:15.2605110Z ## Package Plan ## 2025-05-07T20:25:15.2605294Z 2025-05-07T20:25:15.2605838Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:15.2606161Z 2025-05-07T20:25:15.2606285Z added / updated specs: 2025-05-07T20:25:15.2606552Z - sysroot_linux-64=2.17 2025-05-07T20:25:15.2606732Z 2025-05-07T20:25:15.2606736Z 2025-05-07T20:25:15.2606859Z The following packages will be downloaded: 2025-05-07T20:25:15.2607072Z 2025-05-07T20:25:15.2607196Z package | build 2025-05-07T20:25:15.2607514Z ---------------------------|----------------- 2025-05-07T20:25:15.2607930Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:25:15.2608418Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:25:15.2608827Z ------------------------------------------------------------ 2025-05-07T20:25:15.2609164Z Total: 15.4 MB 2025-05-07T20:25:15.2609377Z 2025-05-07T20:25:15.2609503Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:15.2609730Z 2025-05-07T20:25:15.2610020Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:25:15.2610576Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:25:15.2610878Z 2025-05-07T20:25:15.2610882Z 2025-05-07T20:25:15.2610887Z 2025-05-07T20:25:15.2611031Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:15.2611400Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:15.2611624Z 2025-05-07T20:25:15.4784717Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:25:15.4833167Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:15.4833510Z 2025-05-07T20:25:15.4959713Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:25:15.4960078Z 2025-05-07T20:25:15.5847843Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:15.7174506Z sysroot_linux-64-2.1 | 14.5 MB | ######3 | 64% 2025-05-07T20:25:15.7175032Z 2025-05-07T20:25:15.7175440Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:15.7175706Z 2025-05-07T20:25:15.7650964Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:15.7790714Z sysroot_linux-64-2.1 | 14.5 MB | #########8 | 99% 2025-05-07T20:25:16.3269428Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:16.3272989Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:16.3273331Z 2025-05-07T20:25:16.3273626Z 2025-05-07T20:25:16.3273916Z  done 2025-05-07T20:25:16.4277356Z Preparing transaction: / done 2025-05-07T20:25:16.6288437Z Verifying transaction: \ | done 2025-05-07T20:25:16.8368070Z Executing transaction: - \ done 2025-05-07T20:25:16.9922442Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:25:16.9922773Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:25:18.6663299Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:25:18.6675879Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:25:18.6697718Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:25:19.5603611Z Channels: 2025-05-07T20:25:19.5603854Z - conda-forge 2025-05-07T20:25:19.5604088Z Platform: linux-64 2025-05-07T20:25:22.8347955Z Collecting package metadata (repodata.json): - \ | / - done 2025-05-07T20:25:23.7957870Z Solving environment: | / - done 2025-05-07T20:25:23.8615599Z 2025-05-07T20:25:23.8616096Z ## Package Plan ## 2025-05-07T20:25:23.8616414Z 2025-05-07T20:25:23.8616819Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:23.8617409Z 2025-05-07T20:25:23.8617618Z added / updated specs: 2025-05-07T20:25:23.8618178Z - gxx_linux-64=11.4.0 2025-05-07T20:25:23.8618986Z 2025-05-07T20:25:23.8619011Z 2025-05-07T20:25:23.8619268Z The following packages will be downloaded: 2025-05-07T20:25:23.8619618Z 2025-05-07T20:25:23.8619763Z package | build 2025-05-07T20:25:23.8620111Z ---------------------------|----------------- 2025-05-07T20:25:23.8620512Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:25:23.8620998Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:25:23.8621461Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:25:23.8621904Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:25:23.8622341Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:25:23.8622780Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:25:23.8623215Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:25:23.8623695Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:25:23.8624165Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:25:23.8624607Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:25:23.8625082Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:25:23.8625554Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:25:23.8625967Z ------------------------------------------------------------ 2025-05-07T20:25:23.8626317Z Total: 91.6 MB 2025-05-07T20:25:23.8626529Z 2025-05-07T20:25:23.8626668Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:23.8626887Z 2025-05-07T20:25:23.8627163Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:25:23.8627922Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:25:23.8628460Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:25:23.8628964Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:25:23.8629457Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:25:23.8629957Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:25:23.8630482Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:23.8631037Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:25:23.8631524Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:25:23.8632061Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:23.8632415Z 2025-05-07T20:25:23.8632543Z The following packages will be UPDATED: 2025-05-07T20:25:23.8632757Z 2025-05-07T20:25:23.8633069Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:25:23.8633771Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:25:23.8634176Z 2025-05-07T20:25:23.8634180Z 2025-05-07T20:25:23.8634184Z 2025-05-07T20:25:23.8634327Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:23.8634705Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:23.8634928Z 2025-05-07T20:25:23.8635270Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:23.8635503Z 2025-05-07T20:25:23.8635507Z 2025-05-07T20:25:23.8637906Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:23.8638168Z 2025-05-07T20:25:23.8638172Z 2025-05-07T20:25:23.8644039Z 2025-05-07T20:25:23.8659821Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:23.8660258Z 2025-05-07T20:25:23.8660262Z 2025-05-07T20:25:23.8660266Z 2025-05-07T20:25:23.8663711Z 2025-05-07T20:25:23.8690521Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:23.8690902Z 2025-05-07T20:25:23.8690907Z 2025-05-07T20:25:23.8690912Z 2025-05-07T20:25:23.8690917Z 2025-05-07T20:25:23.8690922Z 2025-05-07T20:25:23.8692323Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:23.8692650Z 2025-05-07T20:25:23.8692654Z 2025-05-07T20:25:23.8692658Z 2025-05-07T20:25:23.8692661Z 2025-05-07T20:25:23.8692665Z 2025-05-07T20:25:23.8692668Z 2025-05-07T20:25:23.8694880Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:25:23.8695182Z 2025-05-07T20:25:23.8695186Z 2025-05-07T20:25:23.8695204Z 2025-05-07T20:25:23.8695207Z 2025-05-07T20:25:23.8695211Z 2025-05-07T20:25:23.8695215Z 2025-05-07T20:25:23.8695219Z 2025-05-07T20:25:23.8696811Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:25:23.8697144Z 2025-05-07T20:25:23.8697148Z 2025-05-07T20:25:23.8697152Z 2025-05-07T20:25:23.8697156Z 2025-05-07T20:25:23.8697160Z 2025-05-07T20:25:23.8697164Z 2025-05-07T20:25:23.8697167Z 2025-05-07T20:25:23.8697171Z 2025-05-07T20:25:23.8698714Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:23.8699096Z 2025-05-07T20:25:23.8699100Z 2025-05-07T20:25:23.8699104Z 2025-05-07T20:25:23.8699108Z 2025-05-07T20:25:23.8699111Z 2025-05-07T20:25:23.8699115Z 2025-05-07T20:25:23.8699119Z 2025-05-07T20:25:23.8699123Z 2025-05-07T20:25:23.8699126Z 2025-05-07T20:25:23.8700585Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:25:23.8700907Z 2025-05-07T20:25:23.8700911Z 2025-05-07T20:25:23.8700914Z 2025-05-07T20:25:23.8700918Z 2025-05-07T20:25:23.8700922Z 2025-05-07T20:25:23.8700925Z 2025-05-07T20:25:23.8700929Z 2025-05-07T20:25:23.8700933Z 2025-05-07T20:25:23.8700947Z 2025-05-07T20:25:23.8701205Z 2025-05-07T20:25:23.8702601Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:25:23.8702973Z 2025-05-07T20:25:23.8702978Z 2025-05-07T20:25:23.8702982Z 2025-05-07T20:25:23.8702985Z 2025-05-07T20:25:23.8702989Z 2025-05-07T20:25:23.8702993Z 2025-05-07T20:25:23.8702997Z 2025-05-07T20:25:23.8703000Z 2025-05-07T20:25:23.8703004Z 2025-05-07T20:25:23.8703015Z 2025-05-07T20:25:23.8703019Z 2025-05-07T20:25:24.0284287Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:25:24.0284599Z 2025-05-07T20:25:24.0284604Z 2025-05-07T20:25:24.0286525Z 2025-05-07T20:25:24.0292789Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:24.0293139Z 2025-05-07T20:25:24.0293143Z 2025-05-07T20:25:24.0293154Z 2025-05-07T20:25:24.0293158Z 2025-05-07T20:25:24.0294700Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:24.0294979Z 2025-05-07T20:25:24.0296655Z 2025-05-07T20:25:24.0583163Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:24.0583433Z 2025-05-07T20:25:24.0584201Z 2025-05-07T20:25:24.0584206Z 2025-05-07T20:25:24.0584320Z 2025-05-07T20:25:24.0833364Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:24.0833641Z 2025-05-07T20:25:24.1172784Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:24.1173031Z 2025-05-07T20:25:24.1173035Z 2025-05-07T20:25:24.1173039Z 2025-05-07T20:25:24.1173042Z 2025-05-07T20:25:24.1173046Z 2025-05-07T20:25:24.1231969Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:24.1284147Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:24.1284387Z 2025-05-07T20:25:24.1284391Z 2025-05-07T20:25:24.1284395Z 2025-05-07T20:25:24.1288233Z binutils_impl_linux- | 6.0 MB | ###### | 60%  2025-05-07T20:25:24.1288512Z 2025-05-07T20:25:24.1288696Z 2025-05-07T20:25:24.1836427Z libstdcxx-devel_linu | 11.1 MB | ##4 | 24%  2025-05-07T20:25:24.1841124Z 2025-05-07T20:25:24.2174574Z gxx_impl_linux-64-11 | 11.2 MB | ######3 | 63%  2025-05-07T20:25:24.2174820Z 2025-05-07T20:25:24.2174823Z 2025-05-07T20:25:24.2174827Z 2025-05-07T20:25:24.2174841Z 2025-05-07T20:25:24.2175063Z 2025-05-07T20:25:24.2233351Z libsanitizer-11.4.0 | 3.5 MB | ####9 | 50%  2025-05-07T20:25:24.2296938Z gcc_impl_linux-64-11 | 53.0 MB | 5 | 6% 2025-05-07T20:25:24.2297170Z 2025-05-07T20:25:24.2298400Z 2025-05-07T20:25:24.3089520Z libstdcxx-devel_linu | 11.1 MB | #####1 | 52%  2025-05-07T20:25:24.3089784Z 2025-05-07T20:25:24.3089788Z 2025-05-07T20:25:24.3089791Z 2025-05-07T20:25:24.3089795Z 2025-05-07T20:25:24.3095875Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:24.3096190Z 2025-05-07T20:25:24.3096195Z 2025-05-07T20:25:24.3096200Z 2025-05-07T20:25:24.3096206Z 2025-05-07T20:25:24.3235661Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:24.3298089Z gcc_impl_linux-64-11 | 53.0 MB | #2 | 13% 2025-05-07T20:25:24.3298324Z 2025-05-07T20:25:24.3298328Z 2025-05-07T20:25:24.3484546Z libstdcxx-devel_linu | 11.1 MB | ########4 | 84%  2025-05-07T20:25:24.3484845Z 2025-05-07T20:25:24.3484927Z 2025-05-07T20:25:24.3484931Z 2025-05-07T20:25:24.3484934Z 2025-05-07T20:25:24.3492023Z 2025-05-07T20:25:24.3492498Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:24.3492797Z 2025-05-07T20:25:24.3492802Z 2025-05-07T20:25:24.3492806Z 2025-05-07T20:25:24.3492810Z 2025-05-07T20:25:24.3492813Z 2025-05-07T20:25:24.3748222Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:24.3748564Z 2025-05-07T20:25:24.3748568Z 2025-05-07T20:25:24.3748572Z 2025-05-07T20:25:24.3753022Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:24.3753306Z 2025-05-07T20:25:24.3753505Z 2025-05-07T20:25:24.3754681Z 2025-05-07T20:25:24.3889903Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:24.3890224Z 2025-05-07T20:25:24.3890228Z 2025-05-07T20:25:24.3890232Z 2025-05-07T20:25:24.3890236Z 2025-05-07T20:25:24.3890239Z 2025-05-07T20:25:24.3890243Z 2025-05-07T20:25:24.4236363Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:25:24.4351640Z gcc_impl_linux-64-11 | 53.0 MB | #9 | 19% 2025-05-07T20:25:24.4351895Z 2025-05-07T20:25:24.4351899Z 2025-05-07T20:25:24.4351902Z 2025-05-07T20:25:24.4351906Z 2025-05-07T20:25:24.4351910Z 2025-05-07T20:25:24.4351913Z 2025-05-07T20:25:24.4351917Z 2025-05-07T20:25:24.4894838Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:25:24.4895132Z 2025-05-07T20:25:24.4895180Z 2025-05-07T20:25:24.4895184Z 2025-05-07T20:25:24.4895472Z 2025-05-07T20:25:24.4895480Z 2025-05-07T20:25:24.4895486Z 2025-05-07T20:25:24.4895502Z 2025-05-07T20:25:24.5239537Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:24.5290563Z gcc_impl_linux-64-11 | 53.0 MB | ##6 | 26% 2025-05-07T20:25:24.5290804Z 2025-05-07T20:25:24.5291022Z 2025-05-07T20:25:24.5291029Z 2025-05-07T20:25:24.5291035Z 2025-05-07T20:25:24.5291042Z 2025-05-07T20:25:24.5291047Z 2025-05-07T20:25:24.5291052Z 2025-05-07T20:25:24.5291091Z 2025-05-07T20:25:24.5350360Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:24.5350685Z 2025-05-07T20:25:24.5350690Z 2025-05-07T20:25:24.5350694Z 2025-05-07T20:25:24.5350697Z 2025-05-07T20:25:24.5350701Z 2025-05-07T20:25:24.5350705Z 2025-05-07T20:25:24.5350708Z 2025-05-07T20:25:24.5350712Z 2025-05-07T20:25:24.5461886Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:24.5462199Z 2025-05-07T20:25:24.5462203Z 2025-05-07T20:25:24.5462282Z 2025-05-07T20:25:24.5462489Z 2025-05-07T20:25:24.5462507Z 2025-05-07T20:25:24.5462511Z 2025-05-07T20:25:24.5462806Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:24.5463088Z 2025-05-07T20:25:24.5463092Z 2025-05-07T20:25:24.5463096Z 2025-05-07T20:25:24.5463100Z 2025-05-07T20:25:24.5463104Z 2025-05-07T20:25:24.5463114Z 2025-05-07T20:25:24.5654904Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:24.5655198Z 2025-05-07T20:25:24.5655202Z 2025-05-07T20:25:24.5655205Z 2025-05-07T20:25:24.5655209Z 2025-05-07T20:25:24.5655213Z 2025-05-07T20:25:24.5655216Z 2025-05-07T20:25:24.5655220Z 2025-05-07T20:25:24.5655224Z 2025-05-07T20:25:24.5655939Z 2025-05-07T20:25:24.5710664Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:25:24.5710982Z 2025-05-07T20:25:24.5710986Z 2025-05-07T20:25:24.5710989Z 2025-05-07T20:25:24.5711004Z 2025-05-07T20:25:24.5711007Z 2025-05-07T20:25:24.5711011Z 2025-05-07T20:25:24.5711026Z 2025-05-07T20:25:24.5711034Z 2025-05-07T20:25:24.5712347Z 2025-05-07T20:25:24.5845118Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:24.5845418Z 2025-05-07T20:25:24.5845422Z 2025-05-07T20:25:24.5845426Z 2025-05-07T20:25:24.5845429Z 2025-05-07T20:25:24.5845433Z 2025-05-07T20:25:24.5845437Z 2025-05-07T20:25:24.5845440Z 2025-05-07T20:25:24.5845444Z 2025-05-07T20:25:24.5845447Z 2025-05-07T20:25:24.5846902Z 2025-05-07T20:25:24.5880456Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:25:24.5880783Z 2025-05-07T20:25:24.5880787Z 2025-05-07T20:25:24.5880790Z 2025-05-07T20:25:24.5880794Z 2025-05-07T20:25:24.5880797Z 2025-05-07T20:25:24.5880801Z 2025-05-07T20:25:24.5880804Z 2025-05-07T20:25:24.5880808Z 2025-05-07T20:25:24.5880811Z 2025-05-07T20:25:24.5883713Z 2025-05-07T20:25:24.6062631Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:24.6062937Z 2025-05-07T20:25:24.6063124Z 2025-05-07T20:25:24.6063129Z 2025-05-07T20:25:24.6063132Z 2025-05-07T20:25:24.6063136Z 2025-05-07T20:25:24.6063140Z 2025-05-07T20:25:24.6063143Z 2025-05-07T20:25:24.6063155Z 2025-05-07T20:25:24.6063158Z 2025-05-07T20:25:24.6063162Z 2025-05-07T20:25:24.6063166Z 2025-05-07T20:25:24.6119110Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:25:24.6119467Z 2025-05-07T20:25:24.6119478Z 2025-05-07T20:25:24.6119482Z 2025-05-07T20:25:24.6119486Z 2025-05-07T20:25:24.6119489Z 2025-05-07T20:25:24.6119493Z 2025-05-07T20:25:24.6119497Z 2025-05-07T20:25:24.6119500Z 2025-05-07T20:25:24.6119504Z 2025-05-07T20:25:24.6119508Z 2025-05-07T20:25:24.6121770Z 2025-05-07T20:25:24.6200776Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:24.6201076Z 2025-05-07T20:25:24.6242337Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:24.6488557Z gcc_impl_linux-64-11 | 53.0 MB | ###3 | 33% 2025-05-07T20:25:24.6488799Z 2025-05-07T20:25:24.6488804Z 2025-05-07T20:25:24.6488807Z 2025-05-07T20:25:24.6488811Z 2025-05-07T20:25:24.6488815Z 2025-05-07T20:25:24.6488825Z 2025-05-07T20:25:24.6490993Z 2025-05-07T20:25:24.6502416Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:24.6502700Z 2025-05-07T20:25:24.6502711Z 2025-05-07T20:25:24.6502715Z 2025-05-07T20:25:24.6502719Z 2025-05-07T20:25:24.6502722Z 2025-05-07T20:25:24.6502726Z 2025-05-07T20:25:24.6503732Z 2025-05-07T20:25:24.6813017Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:24.6813401Z 2025-05-07T20:25:24.6813935Z 2025-05-07T20:25:24.7245070Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:24.7291029Z gcc_impl_linux-64-11 | 53.0 MB | ####5 | 45% 2025-05-07T20:25:24.7291287Z 2025-05-07T20:25:24.7291292Z 2025-05-07T20:25:24.7291296Z 2025-05-07T20:25:24.7291487Z 2025-05-07T20:25:24.7291495Z 2025-05-07T20:25:24.7291499Z 2025-05-07T20:25:24.7291502Z 2025-05-07T20:25:24.7291576Z 2025-05-07T20:25:24.7297138Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:24.7297434Z 2025-05-07T20:25:24.7297438Z 2025-05-07T20:25:24.7297442Z 2025-05-07T20:25:24.7297446Z 2025-05-07T20:25:24.7297449Z 2025-05-07T20:25:24.7297453Z 2025-05-07T20:25:24.7297457Z 2025-05-07T20:25:24.7297460Z 2025-05-07T20:25:24.8041325Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:24.8041661Z 2025-05-07T20:25:24.8041665Z 2025-05-07T20:25:24.8041669Z 2025-05-07T20:25:24.8041672Z 2025-05-07T20:25:24.8041787Z 2025-05-07T20:25:24.8246632Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:24.8604741Z gcc_impl_linux-64-11 | 53.0 MB | #####8 | 58% 2025-05-07T20:25:24.8604977Z 2025-05-07T20:25:24.8604981Z 2025-05-07T20:25:24.8604992Z 2025-05-07T20:25:24.8605006Z 2025-05-07T20:25:24.8605014Z 2025-05-07T20:25:24.8605018Z 2025-05-07T20:25:24.8605021Z 2025-05-07T20:25:24.8605025Z 2025-05-07T20:25:24.8605442Z 2025-05-07T20:25:24.8610470Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:24.8610754Z 2025-05-07T20:25:24.8610758Z 2025-05-07T20:25:24.8610762Z 2025-05-07T20:25:24.8610766Z 2025-05-07T20:25:24.8610769Z 2025-05-07T20:25:24.8610773Z 2025-05-07T20:25:24.8610777Z 2025-05-07T20:25:24.8610780Z 2025-05-07T20:25:24.8611392Z 2025-05-07T20:25:24.8921944Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:24.8922218Z 2025-05-07T20:25:24.8922222Z 2025-05-07T20:25:24.8922225Z 2025-05-07T20:25:24.8922229Z 2025-05-07T20:25:24.8922233Z 2025-05-07T20:25:24.8922236Z 2025-05-07T20:25:24.8922240Z 2025-05-07T20:25:24.8922244Z 2025-05-07T20:25:24.8922247Z 2025-05-07T20:25:24.8922251Z 2025-05-07T20:25:24.8930631Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:24.8930921Z 2025-05-07T20:25:24.8930925Z 2025-05-07T20:25:24.8930928Z 2025-05-07T20:25:24.8930932Z 2025-05-07T20:25:24.8930936Z 2025-05-07T20:25:24.8930939Z 2025-05-07T20:25:24.8930950Z 2025-05-07T20:25:24.8930953Z 2025-05-07T20:25:24.8930957Z 2025-05-07T20:25:24.8932492Z 2025-05-07T20:25:24.8953246Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:24.8953538Z 2025-05-07T20:25:24.8953541Z 2025-05-07T20:25:24.8953545Z 2025-05-07T20:25:24.8953549Z 2025-05-07T20:25:24.8953552Z 2025-05-07T20:25:24.8953556Z 2025-05-07T20:25:24.9313237Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:24.9313529Z 2025-05-07T20:25:24.9313743Z 2025-05-07T20:25:24.9313750Z 2025-05-07T20:25:24.9313997Z 2025-05-07T20:25:24.9314057Z 2025-05-07T20:25:24.9314067Z 2025-05-07T20:25:24.9314083Z 2025-05-07T20:25:24.9314093Z 2025-05-07T20:25:24.9314126Z 2025-05-07T20:25:24.9314153Z 2025-05-07T20:25:24.9314170Z 2025-05-07T20:25:24.9317928Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:24.9318706Z 2025-05-07T20:25:24.9318715Z 2025-05-07T20:25:24.9318734Z 2025-05-07T20:25:24.9318742Z 2025-05-07T20:25:24.9318749Z 2025-05-07T20:25:24.9318756Z 2025-05-07T20:25:24.9318764Z 2025-05-07T20:25:24.9318771Z 2025-05-07T20:25:24.9318778Z 2025-05-07T20:25:24.9318786Z 2025-05-07T20:25:24.9318793Z 2025-05-07T20:25:24.9807161Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:25.0807211Z gcc_impl_linux-64-11 | 53.0 MB | ######7 | 68% 2025-05-07T20:25:25.1347659Z gcc_impl_linux-64-11 | 53.0 MB | ########3 | 84% 2025-05-07T20:25:25.1347908Z 2025-05-07T20:25:25.1347912Z 2025-05-07T20:25:25.1349807Z 2025-05-07T20:25:25.1808972Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:25.2837586Z gcc_impl_linux-64-11 | 53.0 MB | #########7 | 97% 2025-05-07T20:25:25.2838312Z 2025-05-07T20:25:25.3144745Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:25.6272418Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:25.6272720Z 2025-05-07T20:25:25.6272724Z 2025-05-07T20:25:26.0644185Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:26.0650803Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:26.0651299Z 2025-05-07T20:25:26.0651576Z 2025-05-07T20:25:26.0651813Z  2025-05-07T20:25:26.0652019Z 2025-05-07T20:25:26.0652024Z 2025-05-07T20:25:26.0652195Z  2025-05-07T20:25:26.0652401Z 2025-05-07T20:25:26.0652405Z 2025-05-07T20:25:26.0652409Z 2025-05-07T20:25:26.0652583Z  2025-05-07T20:25:26.0652795Z 2025-05-07T20:25:26.0652814Z 2025-05-07T20:25:26.0652828Z 2025-05-07T20:25:26.0652832Z 2025-05-07T20:25:26.0653014Z  2025-05-07T20:25:26.0653396Z 2025-05-07T20:25:26.0653402Z 2025-05-07T20:25:26.0653407Z 2025-05-07T20:25:26.0653412Z 2025-05-07T20:25:26.0653417Z 2025-05-07T20:25:26.0653661Z  2025-05-07T20:25:26.0653952Z 2025-05-07T20:25:26.0653958Z 2025-05-07T20:25:26.0653962Z 2025-05-07T20:25:26.0653968Z 2025-05-07T20:25:26.0653973Z 2025-05-07T20:25:26.0653978Z 2025-05-07T20:25:26.0654215Z  2025-05-07T20:25:26.0654507Z 2025-05-07T20:25:26.0654512Z 2025-05-07T20:25:26.0654517Z 2025-05-07T20:25:26.0654522Z 2025-05-07T20:25:26.0654527Z 2025-05-07T20:25:26.0654533Z 2025-05-07T20:25:26.0654538Z 2025-05-07T20:25:26.0654787Z  2025-05-07T20:25:26.0655037Z 2025-05-07T20:25:26.0655278Z 2025-05-07T20:25:26.0655283Z 2025-05-07T20:25:26.0655286Z 2025-05-07T20:25:26.0655290Z 2025-05-07T20:25:26.0655293Z 2025-05-07T20:25:26.0655297Z 2025-05-07T20:25:26.0655301Z 2025-05-07T20:25:26.0655531Z  2025-05-07T20:25:26.0655754Z 2025-05-07T20:25:26.0655757Z 2025-05-07T20:25:26.0655761Z 2025-05-07T20:25:26.0655765Z 2025-05-07T20:25:26.0655768Z 2025-05-07T20:25:26.0655772Z 2025-05-07T20:25:26.0655776Z 2025-05-07T20:25:26.0655779Z 2025-05-07T20:25:26.0655783Z 2025-05-07T20:25:26.0655980Z  2025-05-07T20:25:26.0656198Z 2025-05-07T20:25:26.0656202Z 2025-05-07T20:25:26.0656205Z 2025-05-07T20:25:26.0656209Z 2025-05-07T20:25:26.0656213Z 2025-05-07T20:25:26.0656216Z 2025-05-07T20:25:26.0656220Z 2025-05-07T20:25:26.0656224Z 2025-05-07T20:25:26.0656227Z 2025-05-07T20:25:26.0656231Z 2025-05-07T20:25:26.0656438Z  2025-05-07T20:25:26.0656659Z 2025-05-07T20:25:26.0656662Z 2025-05-07T20:25:26.0656666Z 2025-05-07T20:25:26.0656670Z 2025-05-07T20:25:26.0656673Z 2025-05-07T20:25:26.0656677Z 2025-05-07T20:25:26.0656680Z 2025-05-07T20:25:26.0656684Z 2025-05-07T20:25:26.0656694Z 2025-05-07T20:25:26.0656698Z 2025-05-07T20:25:26.0656702Z 2025-05-07T20:25:26.0656906Z  done 2025-05-07T20:25:26.1658548Z Preparing transaction: | done 2025-05-07T20:25:26.4673071Z Verifying transaction: - \ | done 2025-05-07T20:25:26.5683105Z Executing transaction: - done 2025-05-07T20:25:26.7318417Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:25:30.6396238Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:30.6396809Z 2025-05-07T20:25:30.6407428Z 2025-05-07T20:25:30.6426521Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:30.6427182Z 2025-05-07T20:25:30.6439122Z 2025-05-07T20:25:30.6456616Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:30.6457299Z 2025-05-07T20:25:30.6468627Z 2025-05-07T20:25:30.6486035Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:30.6486726Z 2025-05-07T20:25:30.6498839Z 2025-05-07T20:25:32.5385251Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:32.5385622Z 2025-05-07T20:25:32.6008218Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:34.4813660Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:34.4814116Z 2025-05-07T20:25:34.5436393Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:36.4274110Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:36.4274419Z 2025-05-07T20:25:36.4908194Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:38.3803030Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:38.3803324Z 2025-05-07T20:25:38.4432130Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:38.4436266Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:38.4436686Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:38.4436897Z 2025-05-07T20:25:40.3353629Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:40.3354093Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:40.3354510Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:40.3354903Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:40.3355248Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:40.3355694Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:40.3356096Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:40.3356914Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:40.3357223Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:40.3357484Z #define __CHAR_BIT__ 8 2025-05-07T20:25:40.3357724Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:40.3358046Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:40.3358344Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:40.3358743Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:40.3359092Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:40.3371318Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:40.3371778Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:40.3372180Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:40.3372631Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:40.3373005Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:40.3373517Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:40.3373938Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:40.3374300Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:40.3374593Z #define __GCC_IEC_559 2 2025-05-07T20:25:40.3374853Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:40.3375142Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:40.3375410Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:40.3375703Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:40.3376039Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:40.3376368Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:40.3376643Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:40.3376930Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:40.3377203Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:40.3377471Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:40.3377749Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:40.3378018Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:40.3378280Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:40.3378545Z #define __INT8_C(c) c 2025-05-07T20:25:40.3378793Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:40.3379378Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:40.3379712Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:40.3380040Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:40.3380395Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:40.3380671Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:40.3380944Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:40.3381231Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:40.3381510Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:40.3381916Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:40.3382336Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:40.3382622Z #define __linux 1 2025-05-07T20:25:40.3382859Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:40.3383144Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:40.3383422Z #define __unix 1 2025-05-07T20:25:40.3383657Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:40.3383956Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:40.3384229Z #define __WINT_MIN__ 0U 2025-05-07T20:25:40.3384484Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:40.3384774Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:40.3385056Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:40.3385325Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:40.3385588Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:40.3385878Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:40.3386174Z #define __INT64_C(c) c ## L 2025-05-07T20:25:40.3386449Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:40.3386756Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:40.3387020Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:40.3387374Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:40.3387755Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:40.3388010Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:40.3388280Z #define __DBL_DIG__ 15 2025-05-07T20:25:40.3388698Z #define __FLT32_DIG__ 6 2025-05-07T20:25:40.3389009Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:40.3389366Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:40.3389627Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:40.3389960Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:40.3390301Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:40.3390557Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:40.3390835Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:40.3391211Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:40.3391616Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:40.3391898Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:40.3392155Z #define __unix__ 1 2025-05-07T20:25:40.3392386Z #define __INT_WIDTH__ 32 2025-05-07T20:25:40.3392638Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:40.3392886Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:40.3393153Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:40.3393435Z #define __UINT16_C(c) c 2025-05-07T20:25:40.3393675Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:40.3393943Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:40.3394305Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:40.3394670Z #define __gnu_linux__ 1 2025-05-07T20:25:40.3394916Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:40.3395199Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:40.3395495Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:40.3395761Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:40.3396032Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:40.3396286Z #define __GNUC__ 11 2025-05-07T20:25:40.3396499Z #define __pie__ 2 2025-05-07T20:25:40.3396718Z #define __MMX__ 1 2025-05-07T20:25:40.3396949Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:40.3397216Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:40.3397498Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:40.3397775Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:40.3398226Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:40.3398628Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:40.3398946Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:40.3399211Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:40.3399475Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:40.3399786Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:40.3400049Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:40.3400313Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:40.3400600Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:40.3400894Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:40.3401166Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:40.3401522Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:40.3401778Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:40.3402050Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:40.3402317Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:40.3402580Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:40.3402852Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:40.3403166Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:40.3403526Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:40.3403795Z #define __SSE2_MATH__ 1 2025-05-07T20:25:40.3404043Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:40.3404350Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:40.3404646Z #define __amd64 1 2025-05-07T20:25:40.3404875Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:40.3405149Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:40.3405459Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:40.3405775Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:40.3406032Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:40.3406314Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:40.3406575Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:40.3406835Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:40.3407107Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:40.3407495Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:40.3407766Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:40.3408053Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:40.3408305Z #define __x86_64 1 2025-05-07T20:25:40.3408535Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:40.3408907Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:40.3409366Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:40.3409819Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:40.3410278Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:40.3410663Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:40.3410918Z #define __LP64__ 1 2025-05-07T20:25:40.3411147Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:40.3411496Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:40.3411875Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:40.3412157Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:40.3412437Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:40.3412723Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:40.3412997Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:40.3413399Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:40.3413661Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:40.3413924Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:40.3414188Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:40.3414522Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:40.3414887Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:40.3415162Z #define __FLT_DIG__ 6 2025-05-07T20:25:40.3415397Z #define __NO_INLINE__ 1 2025-05-07T20:25:40.3415644Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:40.3415964Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:40.3416316Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:40.3416579Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:40.3416941Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:40.3417202Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:40.3417467Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:40.3417725Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:40.3418025Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:40.3418313Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:40.3418585Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:40.3418885Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:40.3419214Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:40.3419481Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:40.3419737Z #define __FLT128_DIG__ 33 2025-05-07T20:25:40.3419984Z #define __INT32_C(c) c 2025-05-07T20:25:40.3420239Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:40.3420514Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:40.3420798Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:40.3421089Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:40.3421409Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:40.3421718Z #define unix 1 2025-05-07T20:25:40.3421954Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:40.3422264Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:40.3422574Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:40.3422889Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:40.3423221Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:40.3423476Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:40.3423741Z #define __ELF__ 1 2025-05-07T20:25:40.3423984Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:40.3424263Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:40.3424540Z #define __FLT_RADIX__ 2 2025-05-07T20:25:40.3424797Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:40.3425151Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:40.3425513Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:40.3425779Z #define __SSE_MATH__ 1 2025-05-07T20:25:40.3426175Z #define __k8 1 2025-05-07T20:25:40.3426480Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:40.3426852Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:40.3427144Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:40.3427446Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:40.3427706Z #define __LDBL_DIG__ 18 2025-05-07T20:25:40.3427954Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:40.3428213Z #define __x86_64__ 1 2025-05-07T20:25:40.3428458Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:40.3428759Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:40.3429090Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:40.3429394Z #define __FLT64_DIG__ 15 2025-05-07T20:25:40.3429677Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:40.3430021Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:40.3430339Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:40.3430618Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:40.3430901Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:40.3431203Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:40.3431568Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:40.3431965Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:40.3432254Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:40.3432587Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:40.3432909Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:40.3433207Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:40.3433492Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:40.3433801Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:40.3434084Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:40.3434331Z #define __SEG_FS 1 2025-05-07T20:25:40.3434570Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:40.3434841Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:40.3435123Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:40.3435518Z #define __SEG_GS 1 2025-05-07T20:25:40.3435827Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:40.3436213Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:40.3436492Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:40.3436786Z #define __INT16_TYPE__ short int 2025-05-07T20:25:40.3437065Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:40.3437362Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:40.3437632Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:40.3437879Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:40.3438148Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:40.3438495Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:40.3438878Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:40.3439172Z #define linux 1 2025-05-07T20:25:40.3439405Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:40.3439678Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:40.3439964Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:40.3440218Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:40.3440480Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:40.3440745Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:40.3441092Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:40.3441499Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:40.3441823Z #define __code_model_small__ 1 2025-05-07T20:25:40.3442098Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:40.3442386Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:40.3442629Z #define __k8__ 1 2025-05-07T20:25:40.3442866Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:40.3443156Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:40.3443450Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:40.3443700Z #define __pic__ 2 2025-05-07T20:25:40.3443956Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:40.3444260Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:40.3444671Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:40.3445007Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:40.3445377Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:40.3445728Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:40.3446035Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:40.3446351Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:40.3446660Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:40.3446923Z #define __linux__ 1 2025-05-07T20:25:40.3447155Z #define __INT64_TYPE__ long int 2025-05-07T20:25:40.3447417Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:40.3447686Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:40.3447965Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:40.3448220Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:40.3448523Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:40.3448857Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:40.3449150Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:40.3449435Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:40.3449734Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:40.3450040Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:40.3450374Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:40.3450736Z #define __SSE__ 1 2025-05-07T20:25:40.3450963Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:40.3451305Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:40.3451650Z #define __amd64__ 1 2025-05-07T20:25:40.3451873Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:40.3452130Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:40.3452406Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:40.3452670Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:40.3452940Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:40.3453330Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:40.3453588Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:40.3453866Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:40.3454237Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:40.3454587Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:40.3455045Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:40.3455403Z #define _LP64 1 2025-05-07T20:25:40.3455625Z #define __UINT8_C(c) c 2025-05-07T20:25:40.3455868Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:40.3456182Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:40.3456465Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:40.3456735Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:40.3457039Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:40.3457400Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:40.3457856Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:40.3458231Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:40.3458528Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:40.3458849Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:40.3459565Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:40.3459999Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:40.3460260Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:40.3460586Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:40.3460945Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:40.3461201Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:40.3461441Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:40.3461688Z #define __FXSR__ 1 2025-05-07T20:25:40.3461981Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:40.3462423Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:40.3462824Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:40.3463126Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:40.3463379Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:40.3463953Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:40.3464310Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:40.3464554Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:40.3464782Z #define __PIC__ 2 2025-05-07T20:25:40.3465033Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:40.3465423Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:40.3465797Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:40.3466127Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:40.3466451Z #define __SSE2__ 1 2025-05-07T20:25:40.3466671Z #define __INT32_TYPE__ int 2025-05-07T20:25:40.3466914Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:40.3467169Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:40.3467500Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:40.3467847Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:40.3468119Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:40.3468399Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:40.3468660Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:40.3468929Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:40.3469176Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:40.3469416Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:40.3469703Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:40.3469997Z #define __PIE__ 2 2025-05-07T20:25:40.3470310Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:40.3470698Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:40.3471043Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:40.3471402Z #define __INT16_C(c) c 2025-05-07T20:25:40.3471621Z #define __STDC__ 1 2025-05-07T20:25:40.3471852Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:40.3472122Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:40.3472374Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:40.3472678Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:40.3473224Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:40.3473543Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:40.3473807Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:40.3474081Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:40.3474337Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:40.3474619Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:40.3474905Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:40.3475167Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:40.3475463Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:40.3475854Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:40.3476272Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:40.3476565Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:40.3476854Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:40.3477104Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:40.3477263Z 2025-05-07T20:25:40.3986707Z 2025-05-07T20:25:40.3987492Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:40.3987965Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:40.3988195Z 2025-05-07T20:25:42.2963065Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:42.2963449Z #define __cpp_attributes 200809L 2025-05-07T20:25:42.2963926Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:42.2964396Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:42.2964780Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:42.2965139Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:42.2965505Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:42.2965839Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:42.2966117Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:42.2966423Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:42.2966721Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:42.2966989Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:42.2967638Z #define __CHAR_BIT__ 8 2025-05-07T20:25:42.2967950Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:42.2978711Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:42.2979036Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:42.2979322Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:42.2979603Z #define __cpp_static_assert 201411L 2025-05-07T20:25:42.2979891Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:42.2980195Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:42.2980499Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:42.2980784Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:42.2981110Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:42.2981434Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:42.2981828Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:42.2982236Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:42.2982552Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:42.2982850Z #define __GCC_IEC_559 2 2025-05-07T20:25:42.2983094Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:42.2983374Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:42.2983651Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:42.2983941Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:42.2984236Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:42.2984555Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:42.2984856Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:42.2985186Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:42.2985507Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:42.2985777Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:42.2986054Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:42.2986335Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:42.2986625Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:42.2986908Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:42.2987206Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:42.2987755Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:42.2988074Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:42.2988399Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:42.2988656Z #define __INT8_C(c) c 2025-05-07T20:25:42.2988885Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:42.2989157Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:42.2989477Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:42.2989791Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:42.2990064Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:42.2990349Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:42.2990655Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:42.2991006Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:42.2991290Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:42.2991565Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:42.2991825Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:42.2992117Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:42.2992394Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:42.2992777Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:42.2993186Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:42.2993472Z #define __linux 1 2025-05-07T20:25:42.2993697Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:42.2993982Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:42.2994261Z #define __unix 1 2025-05-07T20:25:42.2994480Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:42.2994768Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:42.2995062Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:42.2995329Z #define __WINT_MIN__ 0U 2025-05-07T20:25:42.2995579Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:42.2995864Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:42.2996137Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:42.2996395Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:42.2996810Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:42.2997092Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:42.2997381Z #define __INT64_C(c) c ## L 2025-05-07T20:25:42.2997648Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:42.2997949Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:42.2998215Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:42.2998519Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:42.2998797Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:42.2999053Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:42.2999400Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:42.2999774Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:42.3000020Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:42.3000297Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:42.3000573Z #define __DBL_DIG__ 15 2025-05-07T20:25:42.3000805Z #define __FLT32_DIG__ 6 2025-05-07T20:25:42.3001102Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:42.3001457Z #define __GXX_WEAK__ 1 2025-05-07T20:25:42.3001697Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:42.3001941Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:42.3002270Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:42.3002617Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:42.3002875Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:42.3003172Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:42.3003498Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:42.3003887Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:42.3004284Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:42.3004556Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:42.3004813Z #define __unix__ 1 2025-05-07T20:25:42.3005029Z #define __INT_WIDTH__ 32 2025-05-07T20:25:42.3005273Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:42.3005520Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:42.3007586Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:42.3007919Z #define __UINT16_C(c) c 2025-05-07T20:25:42.3008213Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:42.3008518Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:42.3008966Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:42.3009318Z #define __gnu_linux__ 1 2025-05-07T20:25:42.3009558Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:42.3009818Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:42.3010082Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:42.3010367Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:42.3010628Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:42.3010879Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:42.3011126Z #define __GNUC__ 11 2025-05-07T20:25:42.3011338Z #define __GXX_RTTI 1 2025-05-07T20:25:42.3011553Z #define __pie__ 2 2025-05-07T20:25:42.3011759Z #define __MMX__ 1 2025-05-07T20:25:42.3011977Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:42.3012240Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:42.3012524Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:42.3012791Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:42.3013167Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:42.3013456Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:42.3013763Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:42.3014103Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:42.3014461Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:42.3014761Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:42.3015075Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:42.3015332Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:42.3015678Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:42.3015982Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:42.3016316Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:42.3016629Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:42.3016952Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:42.3017434Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:42.3017790Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:42.3018119Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:42.3018432Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:42.3018676Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:42.3018943Z #define __cplusplus 201703L 2025-05-07T20:25:42.3019207Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:42.3019481Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:42.3019736Z #define __DEPRECATED 1 2025-05-07T20:25:42.3019988Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:42.3020270Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:42.3020523Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:42.3020839Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:42.3021189Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:42.3021448Z #define __SSE2_MATH__ 1 2025-05-07T20:25:42.3021694Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:42.3022002Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:42.3022285Z #define __amd64 1 2025-05-07T20:25:42.3022508Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:42.3022777Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:42.3023035Z #define __GNUG__ 11 2025-05-07T20:25:42.3023288Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:42.3023593Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:42.3023837Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:42.3024096Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:42.3024366Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:42.3024608Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:42.3024876Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:42.3025170Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:42.3025433Z #define __cpp_hex_float 201603L 2025-05-07T20:25:42.3025688Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:42.3025951Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:42.3026223Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:42.3026575Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:42.3026843Z #define __x86_64 1 2025-05-07T20:25:42.3027067Z #define __cpp_lambdas 200907L 2025-05-07T20:25:42.3027327Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:42.3027690Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:42.3028071Z #define __cpp_template_auto 201606L 2025-05-07T20:25:42.3028414Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:42.3028859Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:42.3029318Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:42.3029695Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:42.3029944Z #define __LP64__ 1 2025-05-07T20:25:42.3030170Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:42.3030513Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:42.3030879Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:42.3031160Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:42.3031436Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:42.3031703Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:42.3031969Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:42.3032227Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:42.3032483Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:42.3032811Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:42.3033167Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:42.3033443Z #define __FLT_DIG__ 6 2025-05-07T20:25:42.3033667Z #define __NO_INLINE__ 1 2025-05-07T20:25:42.3033908Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:42.3034228Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:42.3034565Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:42.3034822Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:42.3035081Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:42.3035330Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:42.3035732Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:42.3036031Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:42.3036287Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:42.3036645Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:42.3036992Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:42.3037311Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:42.3037681Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:42.3038091Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:42.3038442Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:42.3038756Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:42.3039028Z #define __FLT128_DIG__ 33 2025-05-07T20:25:42.3039261Z #define __INT32_C(c) c 2025-05-07T20:25:42.3039491Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:42.3039766Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:42.3040040Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:42.3040310Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:42.3040632Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:42.3040934Z #define unix 1 2025-05-07T20:25:42.3041148Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:42.3041404Z #define __cpp_rtti 199711L 2025-05-07T20:25:42.3041667Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:42.3041969Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:42.3042269Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:42.3042574Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:42.3042895Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:42.3043139Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:42.3043424Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:42.3043698Z #define __ELF__ 1 2025-05-07T20:25:42.3043924Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:42.3044206Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:42.3044480Z #define __FLT_RADIX__ 2 2025-05-07T20:25:42.3044722Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:42.3045168Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:42.3045523Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:42.3045790Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:42.3046065Z #define __k8 1 2025-05-07T20:25:42.3046411Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:42.3046858Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:42.3047227Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:42.3047594Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:42.3047916Z #define __LDBL_DIG__ 18 2025-05-07T20:25:42.3048209Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:42.3048526Z #define __x86_64__ 1 2025-05-07T20:25:42.3048821Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:42.3049171Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:42.3049502Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:42.3049809Z #define __FLT64_DIG__ 15 2025-05-07T20:25:42.3050090Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:42.3050442Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:42.3050753Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:42.3051015Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:42.3051289Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:42.3051583Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:42.3051946Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:42.3052330Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:42.3052617Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:42.3052931Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:42.3053387Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:42.3053704Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:42.3053996Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:42.3054268Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:42.3054571Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:42.3054941Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:42.3055175Z #define __SEG_FS 1 2025-05-07T20:25:42.3055408Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:42.3055685Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:42.3055954Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:42.3056240Z #define __SEG_GS 1 2025-05-07T20:25:42.3056546Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:42.3056918Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:42.3057181Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:42.3057468Z #define __INT16_TYPE__ short int 2025-05-07T20:25:42.3057741Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:42.3058039Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:42.3058332Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:42.3058583Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:42.3058834Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:42.3059175Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:42.3060085Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:42.3060392Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:42.3060715Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:42.3061008Z #define linux 1 2025-05-07T20:25:42.3061232Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:42.3061508Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:42.3061779Z #define __EXCEPTIONS 1 2025-05-07T20:25:42.3062023Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:42.3062276Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:42.3062541Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:42.3062827Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:42.3063158Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:42.3063537Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:42.3063877Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:42.3064197Z #define __code_model_small__ 1 2025-05-07T20:25:42.3064718Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:42.3065032Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:42.3065328Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:42.3065609Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:42.3065895Z #define __k8__ 1 2025-05-07T20:25:42.3066125Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:42.3066407Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:42.3066727Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:42.3066989Z #define __pic__ 2 2025-05-07T20:25:42.3067229Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:42.3067531Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:42.3067793Z #define __cpp_decltype 200707L 2025-05-07T20:25:42.3068077Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:42.3068399Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:42.3068756Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:42.3069100Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:42.3069396Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:42.3069709Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:42.3069991Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:42.3070234Z #define __linux__ 1 2025-05-07T20:25:42.3070457Z #define __INT64_TYPE__ long int 2025-05-07T20:25:42.3070717Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:42.3070966Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:42.3071233Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:42.3071510Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:42.3071810Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:42.3072097Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:42.3072405Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:42.3072663Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:42.3072949Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:42.3073235Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:42.3073735Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:42.3074099Z #define __SSE__ 1 2025-05-07T20:25:42.3074324Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:42.3074655Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:42.3074988Z #define __amd64__ 1 2025-05-07T20:25:42.3075209Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:42.3075457Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:42.3075716Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:42.3075978Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:42.3076247Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:42.3076498Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:42.3076773Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:42.3077064Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:42.3077421Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:42.3077870Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:42.3078219Z #define _LP64 1 2025-05-07T20:25:42.3078434Z #define __UINT8_C(c) c 2025-05-07T20:25:42.3078671Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:42.3078937Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:42.3079207Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:42.3079459Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:42.3079815Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:42.3080266Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:42.3080625Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:42.3080917Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:42.3081219Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:42.3081515Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:42.3081886Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:42.3082246Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:42.3082502Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:42.3082849Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:42.3083184Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:42.3083542Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:42.3083789Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:42.3084036Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:42.3084286Z #define __FXSR__ 1 2025-05-07T20:25:42.3084577Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:42.3085017Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:42.3085415Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:42.3085709Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:42.3085975Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:42.3086270Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:42.3086556Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:42.3086818Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:42.3087171Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:42.3087539Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:42.3087817Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:42.3088054Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:42.3088291Z #define __PIC__ 2 2025-05-07T20:25:42.3088540Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:42.3088928Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:42.3089305Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:42.3089631Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:42.3089968Z #define __cpp_constexpr 201603L 2025-05-07T20:25:42.3090215Z #define __SSE2__ 1 2025-05-07T20:25:42.3090452Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:42.3090739Z #define __INT32_TYPE__ int 2025-05-07T20:25:42.3090980Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:42.3091242Z #define __cpp_exceptions 199711L 2025-05-07T20:25:42.3091516Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:42.3091932Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:42.3092286Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:42.3092552Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:42.3092812Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:42.3093219Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:42.3093491Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:42.3093732Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:42.3094063Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:42.3105403Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:42.3105755Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:42.3106054Z #define __PIE__ 2 2025-05-07T20:25:42.3106383Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:42.3106820Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:42.3107156Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:42.3107507Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:42.3107882Z #define __INT16_C(c) c 2025-05-07T20:25:42.3108104Z #define __STDC__ 1 2025-05-07T20:25:42.3108322Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:42.3108580Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:42.3108847Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:42.3109109Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:42.3109409Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:42.3109754Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:42.3110080Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:42.3110347Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:42.3110634Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:42.3110903Z #define __SSE_MATH__ 1 2025-05-07T20:25:42.3111142Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:42.3111423Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:42.3111719Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:42.3112003Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:42.3112509Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:42.3112773Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:42.3113069Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:42.3113460Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:42.3113827Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:42.3114123Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:42.3114410Z #define _GNU_SOURCE 1 2025-05-07T20:25:42.3114657Z #define __cpp_init_captures 201304L 2025-05-07T20:25:42.3114930Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:42.3115181Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:42.3115338Z 2025-05-07T20:25:42.3609385Z 2025-05-07T20:25:42.3610196Z + conda run -n build_binary c++ --version 2025-05-07T20:25:42.3610680Z 2025-05-07T20:25:44.2435513Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:44.2436275Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:44.2437204Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:44.2437836Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:44.2438155Z 2025-05-07T20:25:44.2438159Z 2025-05-07T20:25:44.3051134Z 2025-05-07T20:25:44.3051894Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:44.3052461Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:44.3052762Z 2025-05-07T20:25:46.2569559Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:46.2572023Z 2025-05-07T20:25:46.2572550Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:46.2573178Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:46.2573510Z 2025-05-07T20:25:48.2096509Z #define __cplusplus 201703L 2025-05-07T20:25:48.2098575Z 2025-05-07T20:25:48.2099210Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:48.2133741Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:48.2134166Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:48.2146515Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:48.2146856Z env: 2025-05-07T20:25:48.2147081Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:48.2147383Z BUILD_ENV: build_binary 2025-05-07T20:25:48.2147620Z BUILD_TARGET: genai 2025-05-07T20:25:48.2147846Z BUILD_VARIANT: cuda 2025-05-07T20:25:48.2148079Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:48.2148336Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:48.2148629Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:48.2148952Z ##[endgroup] 2025-05-07T20:25:48.5507432Z ################################################################################ 2025-05-07T20:25:48.5507779Z # Install CUDA 2025-05-07T20:25:48.5508001Z # 2025-05-07T20:25:48.5524454Z # [2025-05-07T20:25:48.552Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:48.5524853Z ################################################################################ 2025-05-07T20:25:48.5525077Z 2025-05-07T20:25:48.5541092Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:48.6437520Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:48.6438562Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:48.6444150Z + conda clean --packages --tarball -y 2025-05-07T20:25:48.6444367Z 2025-05-07T20:25:49.5107378Z Will remove 40 (182.7 MB) tarball(s). 2025-05-07T20:25:49.5107838Z Will remove 7 (108.6 MB) package(s). 2025-05-07T20:25:49.5760394Z 2025-05-07T20:25:49.5769204Z + conda clean --all -y 2025-05-07T20:25:49.5769438Z 2025-05-07T20:25:50.2689632Z There are no unused tarball(s) to remove. 2025-05-07T20:25:50.2690338Z Will remove 1 index cache(s). 2025-05-07T20:25:50.2690976Z There are no unused package(s) to remove. 2025-05-07T20:25:50.2691652Z There are no tempfile(s) to remove. 2025-05-07T20:25:50.2692263Z There are no logfile(s) to remove. 2025-05-07T20:25:50.3332937Z 2025-05-07T20:25:50.3346579Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:25:50.3370123Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:25:51.2503807Z Channels: 2025-05-07T20:25:51.2504068Z - conda-forge 2025-05-07T20:25:51.2504310Z Platform: linux-64 2025-05-07T20:26:01.7726194Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:26:02.8630439Z Solving environment: / - \ | / done 2025-05-07T20:26:02.9387007Z 2025-05-07T20:26:02.9387524Z ## Package Plan ## 2025-05-07T20:26:02.9387710Z 2025-05-07T20:26:02.9387931Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:02.9388224Z 2025-05-07T20:26:02.9388322Z added / updated specs: 2025-05-07T20:26:02.9388574Z - cuda=12.6.3 2025-05-07T20:26:02.9388707Z 2025-05-07T20:26:02.9388734Z 2025-05-07T20:26:02.9388877Z The following packages will be downloaded: 2025-05-07T20:26:02.9389089Z 2025-05-07T20:26:02.9389204Z package | build 2025-05-07T20:26:02.9389539Z ---------------------------|----------------- 2025-05-07T20:26:02.9390054Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:26:02.9390616Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:26:02.9391157Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:26:02.9391732Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:26:02.9392160Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:26:02.9392583Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:26:02.9393465Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:26:02.9393969Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:26:02.9394432Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:26:02.9394890Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:26:02.9395330Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:02.9395775Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:02.9396309Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:26:02.9396797Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:02.9397301Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:26:02.9397801Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:26:02.9398270Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:26:02.9398721Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:26:02.9399157Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:26:02.9399606Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:26:02.9400057Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:02.9400537Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:26:02.9400993Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:26:02.9401427Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:02.9401890Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:02.9402359Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:26:02.9402964Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:26:02.9403412Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:26:02.9403875Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:26:02.9404332Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:26:02.9404837Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:26:02.9405287Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:26:02.9405737Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:26:02.9406179Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:26:02.9406614Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:26:02.9407057Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:26:02.9407501Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:26:02.9407944Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:26:02.9408387Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:26:02.9408853Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:26:02.9409311Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:26:02.9409743Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:26:02.9410173Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:26:02.9410619Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:26:02.9411208Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:26:02.9411671Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:26:02.9412137Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:26:02.9412601Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:26:02.9413032Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:26:02.9413551Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:26:02.9414008Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:26:02.9414463Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:26:02.9414867Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:26:02.9415320Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:26:02.9415831Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:26:02.9416342Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:26:02.9416818Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:26:02.9417259Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:26:02.9417719Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:26:02.9418179Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:26:02.9418611Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:26:02.9419007Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:02.9419404Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:26:02.9419793Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:26:02.9420166Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:02.9420641Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:26:02.9421028Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:26:02.9421417Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:26:02.9421826Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:26:02.9422264Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:26:02.9422694Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:26:02.9423125Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:26:02.9423563Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:26:02.9424010Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:26:02.9424452Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:26:02.9424895Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:26:02.9425348Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:26:02.9425796Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:26:02.9426249Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:26:02.9426706Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:26:02.9427170Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:26:02.9427608Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:26:02.9428051Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:26:02.9428602Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:26:02.9429043Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:26:02.9429460Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:26:02.9429884Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:26:02.9430279Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:26:02.9430673Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:26:02.9431098Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:26:02.9431524Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:26:02.9431949Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:26:02.9432406Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:26:02.9432873Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:26:02.9433339Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:26:02.9433783Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:26:02.9434236Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:26:02.9434664Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:26:02.9435082Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:26:02.9435509Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:26:02.9435937Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:26:02.9436346Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:26:02.9436766Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:26:02.9437288Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:26:02.9437699Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:26:02.9438099Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:26:02.9438489Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:26:02.9438924Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:26:02.9439359Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:26:02.9439734Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:26:02.9440109Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:26:02.9440554Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:26:02.9440992Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:26:02.9441407Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:26:02.9441833Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:26:02.9442234Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:26:02.9442623Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:26:02.9443013Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:26:02.9443413Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:26:02.9443840Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:26:02.9444279Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:26:02.9444815Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:26:02.9445298Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:26:02.9445748Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:26:02.9446186Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:26:02.9446629Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:26:02.9447050Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:26:02.9447465Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:26:02.9447887Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:26:02.9448342Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:26:02.9448810Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:26:02.9449264Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:02.9449702Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:26:02.9450143Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:02.9450578Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:26:02.9451008Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:26:02.9451461Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:26:02.9451910Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:26:02.9452310Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:26:02.9452686Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:26:02.9453156Z ------------------------------------------------------------ 2025-05-07T20:26:02.9453611Z Total: 1.61 GB 2025-05-07T20:26:02.9453821Z 2025-05-07T20:26:02.9453949Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:02.9454169Z 2025-05-07T20:26:02.9454375Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:26:02.9454789Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:26:02.9455201Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:26:02.9455646Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:26:02.9456072Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:26:02.9456541Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:26:02.9457114Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:26:02.9457687Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:26:02.9458403Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:26:02.9458950Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:26:02.9459861Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:26:02.9460378Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:26:02.9460942Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:02.9461533Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:26:02.9462128Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:02.9462716Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:02.9463427Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:26:02.9463944Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:26:02.9464433Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:26:02.9465004Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:26:02.9465528Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:26:02.9466085Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:02.9466598Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:26:02.9467082Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:26:02.9467633Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:26:02.9468163Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:26:02.9468632Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:26:02.9469147Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:26:02.9469696Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:26:02.9470227Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:26:02.9470767Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:26:02.9471300Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:26:02.9471809Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:26:02.9472295Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:26:02.9481563Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:26:02.9482151Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:26:02.9482846Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:26:02.9483611Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:26:02.9484135Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:26:02.9484683Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:26:02.9485217Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:26:02.9485716Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:26:02.9486193Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:26:02.9486700Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:26:02.9487266Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:26:02.9487804Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:26:02.9488342Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:26:02.9488882Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:26:02.9489351Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:26:02.9489821Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:26:02.9490331Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:26:02.9490868Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:26:02.9491313Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:26:02.9491813Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:26:02.9492396Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:26:02.9493154Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:26:02.9493722Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:26:02.9494209Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:26:02.9494722Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:26:02.9495225Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:26:02.9495675Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:26:02.9496085Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:26:02.9496499Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:26:02.9496913Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:26:02.9497280Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:26:02.9497686Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:26:02.9498097Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:26:02.9498484Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:26:02.9498918Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:26:02.9499412Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:26:02.9499899Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:26:02.9500362Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:26:02.9500842Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:26:02.9501327Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:26:02.9501808Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:26:02.9502295Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:26:02.9502884Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:26:02.9503404Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:26:02.9503922Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:26:02.9504437Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:26:02.9504939Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:26:02.9505403Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:26:02.9505895Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:26:02.9506394Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:26:02.9506867Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:26:02.9507335Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:26:02.9507802Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:26:02.9508218Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:26:02.9508639Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:26:02.9509096Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:26:02.9509544Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:26:02.9510004Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:26:02.9510527Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:26:02.9511056Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:26:02.9511583Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:26:02.9512198Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:26:02.9512702Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:26:02.9513177Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:26:02.9513617Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:26:02.9514074Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:26:02.9514498Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:26:02.9514948Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:26:02.9515418Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:26:02.9515862Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:26:02.9516265Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:26:02.9516735Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:26:02.9517214Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:26:02.9517583Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:26:02.9517970Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:26:02.9518444Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:26:02.9518923Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:26:02.9519380Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:26:02.9519857Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:26:02.9520280Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:26:02.9520699Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:26:02.9521276Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:26:02.9521954Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:26:02.9522476Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:26:02.9523035Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:26:02.9523550Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:26:02.9524053Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:26:02.9524557Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:26:02.9525019Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:26:02.9525481Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:26:02.9525947Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:26:02.9526481Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:26:02.9527045Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:26:02.9527562Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:26:02.9528048Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:26:02.9528549Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:26:02.9529041Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:26:02.9529526Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:26:02.9530052Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:26:02.9530570Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:26:02.9531007Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:26:02.9531246Z 2025-05-07T20:26:02.9531507Z The following packages will be UPDATED: 2025-05-07T20:26:02.9531715Z 2025-05-07T20:26:02.9531877Z libsqlite 3.46.0-hde9e2c9_0 --> 3.49.2-hee588c1_0 2025-05-07T20:26:02.9532283Z libzlib 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 2025-05-07T20:26:02.9532657Z zlib 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 2025-05-07T20:26:02.9532890Z 2025-05-07T20:26:02.9533225Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:26:02.9533534Z 2025-05-07T20:26:02.9533788Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:26:02.9534362Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:26:02.9534680Z 2025-05-07T20:26:02.9534702Z 2025-05-07T20:26:02.9534707Z 2025-05-07T20:26:02.9534845Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:02.9535234Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:26:02.9535463Z 2025-05-07T20:26:02.9535863Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:26:02.9536099Z 2025-05-07T20:26:02.9536102Z 2025-05-07T20:26:02.9536311Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:26:02.9536556Z 2025-05-07T20:26:02.9536559Z 2025-05-07T20:26:02.9536563Z 2025-05-07T20:26:02.9536788Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:26:02.9537049Z 2025-05-07T20:26:02.9537053Z 2025-05-07T20:26:02.9537057Z 2025-05-07T20:26:02.9537060Z 2025-05-07T20:26:02.9543018Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:26:02.9543368Z 2025-05-07T20:26:02.9543375Z 2025-05-07T20:26:02.9543380Z 2025-05-07T20:26:02.9543385Z 2025-05-07T20:26:02.9543390Z 2025-05-07T20:26:02.9544260Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:26:02.9544548Z 2025-05-07T20:26:02.9544713Z 2025-05-07T20:26:02.9544718Z 2025-05-07T20:26:02.9544732Z 2025-05-07T20:26:02.9544736Z 2025-05-07T20:26:02.9544743Z 2025-05-07T20:26:02.9545780Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:26:02.9546063Z 2025-05-07T20:26:02.9546067Z 2025-05-07T20:26:02.9546078Z 2025-05-07T20:26:02.9546081Z 2025-05-07T20:26:02.9546085Z 2025-05-07T20:26:02.9546089Z 2025-05-07T20:26:02.9546095Z 2025-05-07T20:26:02.9547632Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:26:02.9547959Z 2025-05-07T20:26:02.9547963Z 2025-05-07T20:26:02.9547967Z 2025-05-07T20:26:02.9547970Z 2025-05-07T20:26:02.9547974Z 2025-05-07T20:26:02.9547977Z 2025-05-07T20:26:02.9547981Z 2025-05-07T20:26:02.9547985Z 2025-05-07T20:26:02.9558350Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:26:02.9558757Z 2025-05-07T20:26:02.9558763Z 2025-05-07T20:26:02.9558768Z 2025-05-07T20:26:02.9558783Z 2025-05-07T20:26:02.9558795Z 2025-05-07T20:26:02.9558801Z 2025-05-07T20:26:02.9558806Z 2025-05-07T20:26:02.9558811Z 2025-05-07T20:26:02.9560545Z 2025-05-07T20:26:02.9562309Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:26:02.9562700Z 2025-05-07T20:26:02.9562706Z 2025-05-07T20:26:02.9562711Z 2025-05-07T20:26:02.9562716Z 2025-05-07T20:26:02.9562721Z 2025-05-07T20:26:02.9562726Z 2025-05-07T20:26:02.9562731Z 2025-05-07T20:26:02.9562736Z 2025-05-07T20:26:02.9562742Z 2025-05-07T20:26:02.9562747Z 2025-05-07T20:26:02.9563620Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:26:02.9564002Z 2025-05-07T20:26:02.9564007Z 2025-05-07T20:26:02.9564013Z 2025-05-07T20:26:02.9564018Z 2025-05-07T20:26:02.9564023Z 2025-05-07T20:26:02.9564028Z 2025-05-07T20:26:02.9564033Z 2025-05-07T20:26:02.9564038Z 2025-05-07T20:26:02.9564054Z 2025-05-07T20:26:02.9564060Z 2025-05-07T20:26:02.9564065Z 2025-05-07T20:26:02.9567391Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:26:02.9567812Z 2025-05-07T20:26:02.9567825Z 2025-05-07T20:26:02.9567831Z 2025-05-07T20:26:02.9567836Z 2025-05-07T20:26:02.9567841Z 2025-05-07T20:26:02.9567846Z 2025-05-07T20:26:02.9567851Z 2025-05-07T20:26:02.9567856Z 2025-05-07T20:26:02.9567861Z 2025-05-07T20:26:02.9567866Z 2025-05-07T20:26:02.9567871Z 2025-05-07T20:26:02.9567876Z 2025-05-07T20:26:02.9576337Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:26:02.9576744Z 2025-05-07T20:26:02.9576750Z 2025-05-07T20:26:02.9576755Z 2025-05-07T20:26:02.9576760Z 2025-05-07T20:26:02.9576765Z 2025-05-07T20:26:02.9576770Z 2025-05-07T20:26:02.9576775Z 2025-05-07T20:26:02.9576780Z 2025-05-07T20:26:02.9576785Z 2025-05-07T20:26:02.9576790Z 2025-05-07T20:26:02.9576795Z 2025-05-07T20:26:02.9576800Z 2025-05-07T20:26:02.9579273Z 2025-05-07T20:26:02.9580684Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:26:02.9581107Z 2025-05-07T20:26:02.9581113Z 2025-05-07T20:26:02.9581118Z 2025-05-07T20:26:02.9581123Z 2025-05-07T20:26:02.9581128Z 2025-05-07T20:26:02.9581134Z 2025-05-07T20:26:02.9581139Z 2025-05-07T20:26:02.9581158Z 2025-05-07T20:26:02.9581162Z 2025-05-07T20:26:02.9581167Z 2025-05-07T20:26:02.9581172Z 2025-05-07T20:26:02.9581177Z 2025-05-07T20:26:02.9581182Z 2025-05-07T20:26:02.9581187Z 2025-05-07T20:26:02.9582293Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:26:02.9582707Z 2025-05-07T20:26:02.9582713Z 2025-05-07T20:26:02.9582718Z 2025-05-07T20:26:02.9582723Z 2025-05-07T20:26:02.9582728Z 2025-05-07T20:26:02.9582734Z 2025-05-07T20:26:02.9582739Z 2025-05-07T20:26:02.9582744Z 2025-05-07T20:26:02.9582749Z 2025-05-07T20:26:02.9582754Z 2025-05-07T20:26:02.9582759Z 2025-05-07T20:26:02.9582769Z 2025-05-07T20:26:02.9582774Z 2025-05-07T20:26:02.9582786Z 2025-05-07T20:26:02.9582987Z 2025-05-07T20:26:02.9585827Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:26:02.9586245Z 2025-05-07T20:26:02.9586250Z 2025-05-07T20:26:02.9586255Z 2025-05-07T20:26:02.9586260Z 2025-05-07T20:26:02.9586265Z 2025-05-07T20:26:02.9586271Z 2025-05-07T20:26:02.9586276Z 2025-05-07T20:26:02.9586281Z 2025-05-07T20:26:02.9586286Z 2025-05-07T20:26:02.9586291Z 2025-05-07T20:26:02.9586304Z 2025-05-07T20:26:02.9586310Z 2025-05-07T20:26:02.9586315Z 2025-05-07T20:26:02.9586320Z 2025-05-07T20:26:02.9586325Z 2025-05-07T20:26:02.9586330Z 2025-05-07T20:26:02.9587145Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:26:02.9587580Z 2025-05-07T20:26:02.9587585Z 2025-05-07T20:26:02.9587591Z 2025-05-07T20:26:02.9587595Z 2025-05-07T20:26:02.9587601Z 2025-05-07T20:26:02.9587614Z 2025-05-07T20:26:02.9587619Z 2025-05-07T20:26:02.9587624Z 2025-05-07T20:26:02.9587639Z 2025-05-07T20:26:02.9587652Z 2025-05-07T20:26:02.9587657Z 2025-05-07T20:26:02.9587662Z 2025-05-07T20:26:02.9587667Z 2025-05-07T20:26:02.9587672Z 2025-05-07T20:26:02.9587677Z 2025-05-07T20:26:02.9587682Z 2025-05-07T20:26:02.9587687Z 2025-05-07T20:26:02.9588618Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:26:02.9589026Z 2025-05-07T20:26:02.9589031Z 2025-05-07T20:26:02.9589036Z 2025-05-07T20:26:02.9589041Z 2025-05-07T20:26:02.9589046Z 2025-05-07T20:26:02.9589051Z 2025-05-07T20:26:02.9589056Z 2025-05-07T20:26:02.9589076Z 2025-05-07T20:26:02.9589081Z 2025-05-07T20:26:02.9589087Z 2025-05-07T20:26:02.9589092Z 2025-05-07T20:26:02.9589097Z 2025-05-07T20:26:02.9589102Z 2025-05-07T20:26:02.9589107Z 2025-05-07T20:26:02.9589112Z 2025-05-07T20:26:02.9589117Z 2025-05-07T20:26:02.9589122Z 2025-05-07T20:26:02.9589127Z 2025-05-07T20:26:02.9590357Z libglib-2.84.0 | 3.8 MB | | 0%  2025-05-07T20:26:02.9590768Z 2025-05-07T20:26:02.9590774Z 2025-05-07T20:26:02.9590779Z 2025-05-07T20:26:02.9590784Z 2025-05-07T20:26:02.9590789Z 2025-05-07T20:26:02.9590794Z 2025-05-07T20:26:02.9590807Z 2025-05-07T20:26:02.9590812Z 2025-05-07T20:26:02.9590817Z 2025-05-07T20:26:02.9590822Z 2025-05-07T20:26:02.9590827Z 2025-05-07T20:26:02.9590833Z 2025-05-07T20:26:02.9590837Z 2025-05-07T20:26:02.9590843Z 2025-05-07T20:26:02.9590847Z 2025-05-07T20:26:02.9590853Z 2025-05-07T20:26:02.9590858Z 2025-05-07T20:26:02.9590873Z 2025-05-07T20:26:02.9590879Z 2025-05-07T20:26:03.0492894Z ... (more hidden) ... 2025-05-07T20:26:03.0495218Z 2025-05-07T20:26:03.0504869Z libcublas-12.6.4.1 | 256.2 MB | | 1%  2025-05-07T20:26:03.0505300Z 2025-05-07T20:26:03.0505992Z 2025-05-07T20:26:03.0527885Z libcufft-11.3.0.4 | 156.2 MB | 2 | 2%  2025-05-07T20:26:03.0528235Z 2025-05-07T20:26:03.0528289Z 2025-05-07T20:26:03.0528621Z 2025-05-07T20:26:03.0533542Z libcusparse-12.5.4.2 | 118.6 MB | 2 | 2%  2025-05-07T20:26:03.0534211Z 2025-05-07T20:26:03.0534222Z 2025-05-07T20:26:03.0534227Z 2025-05-07T20:26:03.0534233Z 2025-05-07T20:26:03.0567916Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:26:03.1495287Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:26:03.1495669Z 2025-05-07T20:26:03.1528580Z libcublas-12.6.4.1 | 256.2 MB | 1 | 2%  2025-05-07T20:26:03.1528843Z 2025-05-07T20:26:03.1528847Z 2025-05-07T20:26:03.1529045Z 2025-05-07T20:26:03.1537411Z libcusparse-12.5.4.2 | 118.6 MB | 5 | 5%  2025-05-07T20:26:03.1537776Z 2025-05-07T20:26:03.1537780Z 2025-05-07T20:26:03.1537784Z 2025-05-07T20:26:03.1539029Z 2025-05-07T20:26:03.1577379Z cuda-nsight-12.6.77 | 113.2 MB | 2 | 3%  2025-05-07T20:26:03.1646667Z nsight-compute-2024. | 443.1 MB | | 1% 2025-05-07T20:26:03.1647212Z 2025-05-07T20:26:03.1648393Z 2025-05-07T20:26:03.2498916Z libcufft-11.3.0.4 | 156.2 MB | 4 | 5%  2025-05-07T20:26:03.2500391Z 2025-05-07T20:26:03.2529164Z libcublas-12.6.4.1 | 256.2 MB | 3 | 3%  2025-05-07T20:26:03.2529491Z 2025-05-07T20:26:03.2529495Z 2025-05-07T20:26:03.2530728Z 2025-05-07T20:26:03.2539399Z libcusparse-12.5.4.2 | 118.6 MB | 7 | 8%  2025-05-07T20:26:03.2539674Z 2025-05-07T20:26:03.2539678Z 2025-05-07T20:26:03.2539682Z 2025-05-07T20:26:03.2543458Z 2025-05-07T20:26:03.2580315Z cuda-nsight-12.6.77 | 113.2 MB | 6 | 6%  2025-05-07T20:26:03.2759158Z nsight-compute-2024. | 443.1 MB | 1 | 1% 2025-05-07T20:26:03.2759727Z 2025-05-07T20:26:03.2761808Z 2025-05-07T20:26:03.3499446Z libcufft-11.3.0.4 | 156.2 MB | 6 | 7%  2025-05-07T20:26:03.3502156Z 2025-05-07T20:26:03.3534653Z libcublas-12.6.4.1 | 256.2 MB | 4 | 5%  2025-05-07T20:26:03.3534992Z 2025-05-07T20:26:03.3534998Z 2025-05-07T20:26:03.3536284Z 2025-05-07T20:26:03.3541929Z libcusparse-12.5.4.2 | 118.6 MB | # | 11%  2025-05-07T20:26:03.3542284Z 2025-05-07T20:26:03.3542296Z 2025-05-07T20:26:03.3542300Z 2025-05-07T20:26:03.3543007Z 2025-05-07T20:26:03.3581337Z cuda-nsight-12.6.77 | 113.2 MB | 9 | 9%  2025-05-07T20:26:03.3761642Z nsight-compute-2024. | 443.1 MB | 2 | 2% 2025-05-07T20:26:03.3761980Z 2025-05-07T20:26:03.3762073Z 2025-05-07T20:26:03.4536747Z libcufft-11.3.0.4 | 156.2 MB | 9 | 9%  2025-05-07T20:26:03.4537035Z 2025-05-07T20:26:03.4537039Z 2025-05-07T20:26:03.4538853Z 2025-05-07T20:26:03.4543602Z libcusparse-12.5.4.2 | 118.6 MB | #4 | 14%  2025-05-07T20:26:03.4543955Z 2025-05-07T20:26:03.4543959Z 2025-05-07T20:26:03.4543963Z 2025-05-07T20:26:03.4548309Z 2025-05-07T20:26:03.4563289Z cuda-nsight-12.6.77 | 113.2 MB | #2 | 12%  2025-05-07T20:26:03.4563651Z 2025-05-07T20:26:03.4583320Z libcublas-12.6.4.1 | 256.2 MB | 6 | 6%  2025-05-07T20:26:03.4768414Z nsight-compute-2024. | 443.1 MB | 2 | 3% 2025-05-07T20:26:03.4768747Z 2025-05-07T20:26:03.4768761Z 2025-05-07T20:26:03.5537778Z libcufft-11.3.0.4 | 156.2 MB | #1 | 11%  2025-05-07T20:26:03.5538055Z 2025-05-07T20:26:03.5538059Z 2025-05-07T20:26:03.5541000Z 2025-05-07T20:26:03.5548721Z libcusparse-12.5.4.2 | 118.6 MB | #7 | 17%  2025-05-07T20:26:03.5549031Z 2025-05-07T20:26:03.5549035Z 2025-05-07T20:26:03.5549039Z 2025-05-07T20:26:03.5553386Z 2025-05-07T20:26:03.5568534Z cuda-nsight-12.6.77 | 113.2 MB | #5 | 16%  2025-05-07T20:26:03.5568902Z 2025-05-07T20:26:03.5701003Z libcublas-12.6.4.1 | 256.2 MB | 7 | 8%  2025-05-07T20:26:03.5777178Z nsight-compute-2024. | 443.1 MB | 3 | 4% 2025-05-07T20:26:03.5777517Z 2025-05-07T20:26:03.5777524Z 2025-05-07T20:26:03.6551462Z libcufft-11.3.0.4 | 156.2 MB | #3 | 13%  2025-05-07T20:26:03.6551765Z 2025-05-07T20:26:03.6551770Z 2025-05-07T20:26:03.6551773Z 2025-05-07T20:26:03.6552559Z 2025-05-07T20:26:03.6573809Z cuda-nsight-12.6.77 | 113.2 MB | #8 | 19%  2025-05-07T20:26:03.6575540Z 2025-05-07T20:26:03.6630981Z libcublas-12.6.4.1 | 256.2 MB | 8 | 9%  2025-05-07T20:26:03.6631238Z 2025-05-07T20:26:03.6631242Z 2025-05-07T20:26:03.6633500Z 2025-05-07T20:26:03.6702161Z libcusparse-12.5.4.2 | 118.6 MB | ## | 20%  2025-05-07T20:26:03.6779700Z nsight-compute-2024. | 443.1 MB | 4 | 4% 2025-05-07T20:26:03.6780057Z 2025-05-07T20:26:03.6780063Z 2025-05-07T20:26:03.7597539Z libcufft-11.3.0.4 | 156.2 MB | #5 | 16%  2025-05-07T20:26:03.7603293Z 2025-05-07T20:26:03.7663459Z libcublas-12.6.4.1 | 256.2 MB | # | 10%  2025-05-07T20:26:03.7663789Z 2025-05-07T20:26:03.7663793Z 2025-05-07T20:26:03.7663826Z 2025-05-07T20:26:03.7664077Z 2025-05-07T20:26:03.7704581Z cuda-nsight-12.6.77 | 113.2 MB | ##1 | 22%  2025-05-07T20:26:03.7709067Z nsight-compute-2024. | 443.1 MB | 5 | 5% 2025-05-07T20:26:03.7709402Z 2025-05-07T20:26:03.7709525Z 2025-05-07T20:26:03.7709552Z 2025-05-07T20:26:03.7817086Z libcusparse-12.5.4.2 | 118.6 MB | ##3 | 23%  2025-05-07T20:26:03.7817382Z 2025-05-07T20:26:03.7817386Z 2025-05-07T20:26:03.8636541Z libcufft-11.3.0.4 | 156.2 MB | #8 | 18%  2025-05-07T20:26:03.8639337Z 2025-05-07T20:26:03.8665443Z libcublas-12.6.4.1 | 256.2 MB | #1 | 12%  2025-05-07T20:26:03.8665763Z 2025-05-07T20:26:03.8665767Z 2025-05-07T20:26:03.8665771Z 2025-05-07T20:26:03.8666230Z 2025-05-07T20:26:03.8706817Z cuda-nsight-12.6.77 | 113.2 MB | ##5 | 25%  2025-05-07T20:26:03.8707719Z nsight-compute-2024. | 443.1 MB | 5 | 6% 2025-05-07T20:26:03.8708048Z 2025-05-07T20:26:03.8708053Z 2025-05-07T20:26:03.8710328Z 2025-05-07T20:26:03.8820042Z libcusparse-12.5.4.2 | 118.6 MB | ##6 | 26%  2025-05-07T20:26:03.8820324Z 2025-05-07T20:26:03.8820940Z 2025-05-07T20:26:03.9637509Z libcufft-11.3.0.4 | 156.2 MB | ## | 20%  2025-05-07T20:26:03.9638499Z 2025-05-07T20:26:03.9668684Z libcublas-12.6.4.1 | 256.2 MB | #2 | 13%  2025-05-07T20:26:03.9669028Z 2025-05-07T20:26:03.9669032Z 2025-05-07T20:26:03.9669036Z 2025-05-07T20:26:03.9674884Z 2025-05-07T20:26:03.9708801Z cuda-nsight-12.6.77 | 113.2 MB | ##8 | 29%  2025-05-07T20:26:03.9711835Z nsight-compute-2024. | 443.1 MB | 6 | 7% 2025-05-07T20:26:03.9712164Z 2025-05-07T20:26:03.9712168Z 2025-05-07T20:26:03.9712172Z 2025-05-07T20:26:03.9822709Z libcusparse-12.5.4.2 | 118.6 MB | ##9 | 29%  2025-05-07T20:26:03.9823073Z 2025-05-07T20:26:03.9828166Z 2025-05-07T20:26:04.0639372Z libcufft-11.3.0.4 | 156.2 MB | ##2 | 23%  2025-05-07T20:26:04.0642409Z 2025-05-07T20:26:04.0671386Z libcublas-12.6.4.1 | 256.2 MB | #4 | 15%  2025-05-07T20:26:04.0671739Z 2025-05-07T20:26:04.0671745Z 2025-05-07T20:26:04.0671754Z 2025-05-07T20:26:04.0673533Z 2025-05-07T20:26:04.0718552Z cuda-nsight-12.6.77 | 113.2 MB | ###2 | 32%  2025-05-07T20:26:04.0718951Z 2025-05-07T20:26:04.0718957Z 2025-05-07T20:26:04.0720346Z 2025-05-07T20:26:04.0825700Z libcusparse-12.5.4.2 | 118.6 MB | ###2 | 33%  2025-05-07T20:26:04.0825986Z 2025-05-07T20:26:04.0825990Z 2025-05-07T20:26:04.0962485Z libcufft-11.3.0.4 | 156.2 MB | ##5 | 25%  2025-05-07T20:26:04.1720150Z nsight-compute-2024. | 443.1 MB | 7 | 8% 2025-05-07T20:26:04.1720427Z 2025-05-07T20:26:04.1720431Z 2025-05-07T20:26:04.1723699Z 2025-05-07T20:26:04.1742970Z libcusparse-12.5.4.2 | 118.6 MB | ###5 | 36%  2025-05-07T20:26:04.1743382Z 2025-05-07T20:26:04.1743387Z 2025-05-07T20:26:04.1743390Z 2025-05-07T20:26:04.1745817Z 2025-05-07T20:26:04.1759695Z cuda-nsight-12.6.77 | 113.2 MB | ###5 | 35%  2025-05-07T20:26:04.1763715Z 2025-05-07T20:26:04.1862494Z libcublas-12.6.4.1 | 256.2 MB | #5 | 16%  2025-05-07T20:26:04.1862806Z 2025-05-07T20:26:04.1862844Z 2025-05-07T20:26:04.1963707Z libcufft-11.3.0.4 | 156.2 MB | ##7 | 28%  2025-05-07T20:26:04.2744954Z nsight-compute-2024. | 443.1 MB | 8 | 8% 2025-05-07T20:26:04.2745301Z 2025-05-07T20:26:04.2745306Z 2025-05-07T20:26:04.2745311Z 2025-05-07T20:26:04.2746018Z 2025-05-07T20:26:04.2760390Z cuda-nsight-12.6.77 | 113.2 MB | ###8 | 39%  2025-05-07T20:26:04.2761472Z 2025-05-07T20:26:04.2813301Z libcublas-12.6.4.1 | 256.2 MB | #7 | 17%  2025-05-07T20:26:04.2813590Z 2025-05-07T20:26:04.2813595Z 2025-05-07T20:26:04.2814025Z 2025-05-07T20:26:04.2919156Z libcusparse-12.5.4.2 | 118.6 MB | ###8 | 39%  2025-05-07T20:26:04.2919494Z 2025-05-07T20:26:04.2920601Z 2025-05-07T20:26:04.2963835Z libcufft-11.3.0.4 | 156.2 MB | ##9 | 30%  2025-05-07T20:26:04.3750758Z nsight-compute-2024. | 443.1 MB | 9 | 9% 2025-05-07T20:26:04.3751092Z 2025-05-07T20:26:04.3751096Z 2025-05-07T20:26:04.3751100Z 2025-05-07T20:26:04.3753261Z 2025-05-07T20:26:04.3764440Z cuda-nsight-12.6.77 | 113.2 MB | ####2 | 42%  2025-05-07T20:26:04.3764766Z 2025-05-07T20:26:04.3814080Z libcublas-12.6.4.1 | 256.2 MB | #8 | 19%  2025-05-07T20:26:04.3814337Z 2025-05-07T20:26:04.3814440Z 2025-05-07T20:26:04.3815258Z 2025-05-07T20:26:04.3921263Z libcusparse-12.5.4.2 | 118.6 MB | ####2 | 42%  2025-05-07T20:26:04.3921542Z 2025-05-07T20:26:04.3922727Z 2025-05-07T20:26:04.3966330Z libcufft-11.3.0.4 | 156.2 MB | ###2 | 32%  2025-05-07T20:26:04.4766041Z nsight-compute-2024. | 443.1 MB | # | 10% 2025-05-07T20:26:04.4770132Z 2025-05-07T20:26:04.4816611Z libcublas-12.6.4.1 | 256.2 MB | ## | 20%  2025-05-07T20:26:04.4816942Z 2025-05-07T20:26:04.4816951Z 2025-05-07T20:26:04.4816960Z 2025-05-07T20:26:04.4841790Z libcusparse-12.5.4.2 | 118.6 MB | ####5 | 45%  2025-05-07T20:26:04.4842106Z 2025-05-07T20:26:04.4842114Z 2025-05-07T20:26:04.4842121Z 2025-05-07T20:26:04.4843416Z 2025-05-07T20:26:04.4923787Z cuda-nsight-12.6.77 | 113.2 MB | ####5 | 45%  2025-05-07T20:26:04.4924091Z 2025-05-07T20:26:04.4924096Z 2025-05-07T20:26:04.4966622Z libcufft-11.3.0.4 | 156.2 MB | ###4 | 35%  2025-05-07T20:26:04.5768921Z nsight-compute-2024. | 443.1 MB | # | 11% 2025-05-07T20:26:04.5770428Z 2025-05-07T20:26:04.5821530Z libcublas-12.6.4.1 | 256.2 MB | ##1 | 22%  2025-05-07T20:26:04.5821809Z 2025-05-07T20:26:04.5821813Z 2025-05-07T20:26:04.5823599Z 2025-05-07T20:26:04.5842757Z libcusparse-12.5.4.2 | 118.6 MB | ####8 | 48%  2025-05-07T20:26:04.5843114Z 2025-05-07T20:26:04.5843118Z 2025-05-07T20:26:04.5843405Z 2025-05-07T20:26:04.5844671Z 2025-05-07T20:26:04.5954128Z cuda-nsight-12.6.77 | 113.2 MB | ####8 | 49%  2025-05-07T20:26:04.5954436Z 2025-05-07T20:26:04.5954996Z 2025-05-07T20:26:04.5970238Z libcufft-11.3.0.4 | 156.2 MB | ###6 | 37%  2025-05-07T20:26:04.6769070Z nsight-compute-2024. | 443.1 MB | #1 | 12% 2025-05-07T20:26:04.6777446Z 2025-05-07T20:26:04.6843765Z libcublas-12.6.4.1 | 256.2 MB | ##3 | 24%  2025-05-07T20:26:04.6844018Z 2025-05-07T20:26:04.6844206Z 2025-05-07T20:26:04.6844224Z 2025-05-07T20:26:04.6846124Z 2025-05-07T20:26:04.6860968Z cuda-nsight-12.6.77 | 113.2 MB | #####2 | 53%  2025-05-07T20:26:04.6861280Z 2025-05-07T20:26:04.6861286Z 2025-05-07T20:26:04.6864354Z 2025-05-07T20:26:04.6972541Z libcusparse-12.5.4.2 | 118.6 MB | #####1 | 52%  2025-05-07T20:26:04.6992592Z nsight-compute-2024. | 443.1 MB | #2 | 13% 2025-05-07T20:26:04.6992862Z 2025-05-07T20:26:04.6992866Z 2025-05-07T20:26:04.7771660Z libcufft-11.3.0.4 | 156.2 MB | ###9 | 39%  2025-05-07T20:26:04.7772510Z 2025-05-07T20:26:04.7846430Z libcublas-12.6.4.1 | 256.2 MB | ##5 | 25%  2025-05-07T20:26:04.7846763Z 2025-05-07T20:26:04.7846772Z 2025-05-07T20:26:04.7846780Z 2025-05-07T20:26:04.7849226Z 2025-05-07T20:26:04.7865141Z cuda-nsight-12.6.77 | 113.2 MB | #####5 | 56%  2025-05-07T20:26:04.7865492Z 2025-05-07T20:26:04.7865497Z 2025-05-07T20:26:04.7865863Z 2025-05-07T20:26:04.7978952Z libcusparse-12.5.4.2 | 118.6 MB | #####4 | 55%  2025-05-07T20:26:04.7993492Z nsight-compute-2024. | 443.1 MB | #3 | 13% 2025-05-07T20:26:04.7993773Z 2025-05-07T20:26:04.7993777Z 2025-05-07T20:26:04.8772694Z libcufft-11.3.0.4 | 156.2 MB | ####1 | 42%  2025-05-07T20:26:04.8774093Z 2025-05-07T20:26:04.8851734Z libcublas-12.6.4.1 | 256.2 MB | ##6 | 27%  2025-05-07T20:26:04.8851996Z 2025-05-07T20:26:04.8852000Z 2025-05-07T20:26:04.8852042Z 2025-05-07T20:26:04.8852620Z 2025-05-07T20:26:04.8866492Z cuda-nsight-12.6.77 | 113.2 MB | #####9 | 59%  2025-05-07T20:26:04.8866824Z 2025-05-07T20:26:04.8866828Z 2025-05-07T20:26:04.8868405Z 2025-05-07T20:26:04.8980559Z libcusparse-12.5.4.2 | 118.6 MB | #####8 | 58%  2025-05-07T20:26:04.8996133Z nsight-compute-2024. | 443.1 MB | #4 | 14% 2025-05-07T20:26:04.8996391Z 2025-05-07T20:26:04.8996395Z 2025-05-07T20:26:04.9853898Z libcufft-11.3.0.4 | 156.2 MB | ####4 | 44%  2025-05-07T20:26:04.9854183Z 2025-05-07T20:26:04.9854187Z 2025-05-07T20:26:04.9854191Z 2025-05-07T20:26:04.9855407Z 2025-05-07T20:26:04.9867870Z cuda-nsight-12.6.77 | 113.2 MB | ######2 | 63%  2025-05-07T20:26:04.9868156Z 2025-05-07T20:26:04.9868160Z 2025-05-07T20:26:04.9868164Z 2025-05-07T20:26:04.9983142Z libcusparse-12.5.4.2 | 118.6 MB | ######1 | 62%  2025-05-07T20:26:04.9998577Z nsight-compute-2024. | 443.1 MB | #5 | 15% 2025-05-07T20:26:04.9998916Z 2025-05-07T20:26:05.0002664Z 2025-05-07T20:26:05.0008913Z libcufft-11.3.0.4 | 156.2 MB | ####6 | 47%  2025-05-07T20:26:05.0009196Z 2025-05-07T20:26:05.0941415Z libcublas-12.6.4.1 | 256.2 MB | ##8 | 28%  2025-05-07T20:26:05.0941689Z 2025-05-07T20:26:05.0941701Z 2025-05-07T20:26:05.0941705Z 2025-05-07T20:26:05.0943791Z 2025-05-07T20:26:05.0956904Z cuda-nsight-12.6.77 | 113.2 MB | ######6 | 66%  2025-05-07T20:26:05.0957194Z 2025-05-07T20:26:05.0957207Z 2025-05-07T20:26:05.0958014Z 2025-05-07T20:26:05.1014298Z libcusparse-12.5.4.2 | 118.6 MB | ######4 | 65%  2025-05-07T20:26:05.1015304Z 2025-05-07T20:26:05.1020342Z libcublas-12.6.4.1 | 256.2 MB | ##9 | 30%  2025-05-07T20:26:05.1020600Z 2025-05-07T20:26:05.1021164Z 2025-05-07T20:26:05.1071436Z libcufft-11.3.0.4 | 156.2 MB | ####9 | 49%  2025-05-07T20:26:05.2011594Z nsight-compute-2024. | 443.1 MB | #6 | 16% 2025-05-07T20:26:05.2012162Z 2025-05-07T20:26:05.2012182Z 2025-05-07T20:26:05.2012501Z 2025-05-07T20:26:05.2017846Z libcusparse-12.5.4.2 | 118.6 MB | ######8 | 68%  2025-05-07T20:26:05.2021589Z 2025-05-07T20:26:05.2027502Z libcublas-12.6.4.1 | 256.2 MB | ###1 | 31%  2025-05-07T20:26:05.2027766Z 2025-05-07T20:26:05.2029000Z 2025-05-07T20:26:05.2115706Z libcufft-11.3.0.4 | 156.2 MB | #####1 | 52%  2025-05-07T20:26:05.2195302Z nsight-compute-2024. | 443.1 MB | #6 | 17% 2025-05-07T20:26:05.2195831Z 2025-05-07T20:26:05.2195976Z 2025-05-07T20:26:05.2195983Z 2025-05-07T20:26:05.2198617Z 2025-05-07T20:26:05.3019761Z cuda-nsight-12.6.77 | 113.2 MB | ######9 | 70%  2025-05-07T20:26:05.3022105Z 2025-05-07T20:26:05.3045942Z libcublas-12.6.4.1 | 256.2 MB | ###2 | 33%  2025-05-07T20:26:05.3046274Z 2025-05-07T20:26:05.3046281Z 2025-05-07T20:26:05.3048725Z 2025-05-07T20:26:05.3095177Z libcusparse-12.5.4.2 | 118.6 MB | #######1 | 71%  2025-05-07T20:26:05.3095572Z 2025-05-07T20:26:05.3095989Z 2025-05-07T20:26:05.3127317Z libcufft-11.3.0.4 | 156.2 MB | #####4 | 54%  2025-05-07T20:26:05.3199019Z nsight-compute-2024. | 443.1 MB | #7 | 18% 2025-05-07T20:26:05.3199291Z 2025-05-07T20:26:05.3199298Z 2025-05-07T20:26:05.3199303Z 2025-05-07T20:26:05.3200780Z 2025-05-07T20:26:05.4019662Z cuda-nsight-12.6.77 | 113.2 MB | #######2 | 73%  2025-05-07T20:26:05.4021177Z 2025-05-07T20:26:05.4066407Z libcublas-12.6.4.1 | 256.2 MB | ###3 | 34%  2025-05-07T20:26:05.4066734Z 2025-05-07T20:26:05.4066740Z 2025-05-07T20:26:05.4066753Z 2025-05-07T20:26:05.4171172Z libcusparse-12.5.4.2 | 118.6 MB | #######4 | 74%  2025-05-07T20:26:05.4201351Z nsight-compute-2024. | 443.1 MB | #8 | 19% 2025-05-07T20:26:05.4201610Z 2025-05-07T20:26:05.4201614Z 2025-05-07T20:26:05.4201619Z 2025-05-07T20:26:05.4203966Z 2025-05-07T20:26:05.4249864Z cuda-nsight-12.6.77 | 113.2 MB | #######5 | 76%  2025-05-07T20:26:05.4250509Z 2025-05-07T20:26:05.4250515Z 2025-05-07T20:26:05.5042635Z libcufft-11.3.0.4 | 156.2 MB | #####6 | 57%  2025-05-07T20:26:05.5043706Z 2025-05-07T20:26:05.5068095Z libcublas-12.6.4.1 | 256.2 MB | ###5 | 35%  2025-05-07T20:26:05.5068373Z 2025-05-07T20:26:05.5068378Z 2025-05-07T20:26:05.5070329Z 2025-05-07T20:26:05.5202557Z libcusparse-12.5.4.2 | 118.6 MB | #######7 | 77%  2025-05-07T20:26:05.5202940Z 2025-05-07T20:26:05.5202945Z 2025-05-07T20:26:05.5202949Z 2025-05-07T20:26:05.5202952Z 2025-05-07T20:26:05.5220556Z cuda-nsight-12.6.77 | 113.2 MB | #######9 | 79%  2025-05-07T20:26:05.5259922Z nsight-compute-2024. | 443.1 MB | #9 | 19% 2025-05-07T20:26:05.5260177Z 2025-05-07T20:26:05.5260832Z 2025-05-07T20:26:05.6048300Z libcufft-11.3.0.4 | 156.2 MB | #####9 | 59%  2025-05-07T20:26:05.6050384Z 2025-05-07T20:26:05.6080443Z libcublas-12.6.4.1 | 256.2 MB | ###6 | 37%  2025-05-07T20:26:05.6080713Z 2025-05-07T20:26:05.6080717Z 2025-05-07T20:26:05.6080729Z 2025-05-07T20:26:05.6203531Z libcusparse-12.5.4.2 | 118.6 MB | ######## | 81%  2025-05-07T20:26:05.6203891Z 2025-05-07T20:26:05.6203897Z 2025-05-07T20:26:05.6203903Z 2025-05-07T20:26:05.6205947Z 2025-05-07T20:26:05.6302856Z cuda-nsight-12.6.77 | 113.2 MB | ########2 | 83%  2025-05-07T20:26:05.6303149Z 2025-05-07T20:26:05.6303154Z 2025-05-07T20:26:05.6326668Z libcufft-11.3.0.4 | 156.2 MB | ######1 | 61%  2025-05-07T20:26:05.7050859Z nsight-compute-2024. | 443.1 MB | ## | 20% 2025-05-07T20:26:05.7051738Z 2025-05-07T20:26:05.7094946Z libcublas-12.6.4.1 | 256.2 MB | ###8 | 38%  2025-05-07T20:26:05.7095298Z 2025-05-07T20:26:05.7095303Z 2025-05-07T20:26:05.7095950Z 2025-05-07T20:26:05.7272589Z libcusparse-12.5.4.2 | 118.6 MB | ########3 | 84%  2025-05-07T20:26:05.7272917Z 2025-05-07T20:26:05.7272922Z 2025-05-07T20:26:05.7273186Z 2025-05-07T20:26:05.7273202Z 2025-05-07T20:26:05.7303565Z cuda-nsight-12.6.77 | 113.2 MB | ########5 | 86%  2025-05-07T20:26:05.7303948Z 2025-05-07T20:26:05.7303952Z 2025-05-07T20:26:05.7356075Z libcufft-11.3.0.4 | 156.2 MB | ######3 | 64%  2025-05-07T20:26:05.8051529Z nsight-compute-2024. | 443.1 MB | ## | 21% 2025-05-07T20:26:05.8051813Z 2025-05-07T20:26:05.8104395Z libcublas-12.6.4.1 | 256.2 MB | ###9 | 40%  2025-05-07T20:26:05.8104666Z 2025-05-07T20:26:05.8104672Z 2025-05-07T20:26:05.8106179Z 2025-05-07T20:26:05.8282872Z libcusparse-12.5.4.2 | 118.6 MB | ########7 | 87%  2025-05-07T20:26:05.8283231Z 2025-05-07T20:26:05.8283235Z 2025-05-07T20:26:05.8283239Z 2025-05-07T20:26:05.8284532Z 2025-05-07T20:26:05.8307801Z cuda-nsight-12.6.77 | 113.2 MB | ########9 | 89%  2025-05-07T20:26:05.8308084Z 2025-05-07T20:26:05.8308088Z 2025-05-07T20:26:05.8418242Z libcufft-11.3.0.4 | 156.2 MB | ######6 | 66%  2025-05-07T20:26:05.9075372Z nsight-compute-2024. | 443.1 MB | ##1 | 22% 2025-05-07T20:26:05.9077697Z 2025-05-07T20:26:05.9207141Z libcublas-12.6.4.1 | 256.2 MB | ####1 | 41%  2025-05-07T20:26:05.9207409Z 2025-05-07T20:26:05.9207421Z 2025-05-07T20:26:05.9207850Z 2025-05-07T20:26:05.9299360Z libcusparse-12.5.4.2 | 118.6 MB | ######### | 90%  2025-05-07T20:26:05.9299761Z 2025-05-07T20:26:05.9299767Z 2025-05-07T20:26:05.9299783Z 2025-05-07T20:26:05.9301500Z 2025-05-07T20:26:05.9368512Z cuda-nsight-12.6.77 | 113.2 MB | #########2 | 92%  2025-05-07T20:26:05.9368819Z 2025-05-07T20:26:05.9368823Z 2025-05-07T20:26:05.9419960Z libcufft-11.3.0.4 | 156.2 MB | ######8 | 69%  2025-05-07T20:26:06.0076452Z nsight-compute-2024. | 443.1 MB | ##2 | 23% 2025-05-07T20:26:06.0077214Z 2025-05-07T20:26:06.0209768Z libcublas-12.6.4.1 | 256.2 MB | ####3 | 43%  2025-05-07T20:26:06.0210041Z 2025-05-07T20:26:06.0210045Z 2025-05-07T20:26:06.0212757Z 2025-05-07T20:26:06.0324543Z libcusparse-12.5.4.2 | 118.6 MB | #########3 | 93%  2025-05-07T20:26:06.0324827Z 2025-05-07T20:26:06.0324834Z 2025-05-07T20:26:06.0324839Z 2025-05-07T20:26:06.0327291Z 2025-05-07T20:26:06.0371220Z cuda-nsight-12.6.77 | 113.2 MB | #########5 | 96%  2025-05-07T20:26:06.0371547Z 2025-05-07T20:26:06.0371551Z 2025-05-07T20:26:06.0425653Z libcufft-11.3.0.4 | 156.2 MB | #######1 | 71%  2025-05-07T20:26:06.1118083Z nsight-compute-2024. | 443.1 MB | ##3 | 23% 2025-05-07T20:26:06.1118599Z 2025-05-07T20:26:06.1330067Z libcublas-12.6.4.1 | 256.2 MB | ####4 | 45%  2025-05-07T20:26:06.1330344Z 2025-05-07T20:26:06.1330348Z 2025-05-07T20:26:06.1330352Z 2025-05-07T20:26:06.1330356Z 2025-05-07T20:26:06.1372211Z cuda-nsight-12.6.77 | 113.2 MB | #########8 | 99%  2025-05-07T20:26:06.1372586Z 2025-05-07T20:26:06.1372597Z 2025-05-07T20:26:06.1426076Z libcufft-11.3.0.4 | 156.2 MB | #######3 | 74%  2025-05-07T20:26:06.1771096Z nsight-compute-2024. | 443.1 MB | ##4 | 24% 2025-05-07T20:26:06.1771425Z 2025-05-07T20:26:06.1771431Z 2025-05-07T20:26:06.1771437Z 2025-05-07T20:26:06.2119516Z libcusparse-12.5.4.2 | 118.6 MB | #########6 | 97%  2025-05-07T20:26:06.2120059Z 2025-05-07T20:26:06.2372302Z libcublas-12.6.4.1 | 256.2 MB | ####6 | 46%  2025-05-07T20:26:06.2372657Z 2025-05-07T20:26:06.2372663Z 2025-05-07T20:26:06.2427011Z libcufft-11.3.0.4 | 156.2 MB | #######6 | 76%  2025-05-07T20:26:06.3121844Z nsight-compute-2024. | 443.1 MB | ##5 | 25% 2025-05-07T20:26:06.3122167Z 2025-05-07T20:26:06.3374295Z libcublas-12.6.4.1 | 256.2 MB | ####8 | 48%  2025-05-07T20:26:06.3374559Z 2025-05-07T20:26:06.3374743Z 2025-05-07T20:26:06.3427540Z libcufft-11.3.0.4 | 156.2 MB | #######9 | 79%  2025-05-07T20:26:06.3750269Z nsight-compute-2024. | 443.1 MB | ##6 | 26% 2025-05-07T20:26:06.3750560Z 2025-05-07T20:26:06.3750844Z 2025-05-07T20:26:06.3751396Z 2025-05-07T20:26:06.4125771Z libcusparse-12.5.4.2 | 118.6 MB | #########9 | 99%  2025-05-07T20:26:06.4126496Z 2025-05-07T20:26:06.4411911Z libcublas-12.6.4.1 | 256.2 MB | ####9 | 50%  2025-05-07T20:26:06.4412193Z 2025-05-07T20:26:06.4413371Z 2025-05-07T20:26:06.4498484Z libcufft-11.3.0.4 | 156.2 MB | ########1 | 82%  2025-05-07T20:26:06.5152221Z nsight-compute-2024. | 443.1 MB | ##7 | 27% 2025-05-07T20:26:06.5154124Z 2025-05-07T20:26:06.5412815Z libcublas-12.6.4.1 | 256.2 MB | #####1 | 52%  2025-05-07T20:26:06.5413077Z 2025-05-07T20:26:06.5415041Z 2025-05-07T20:26:06.5659590Z libcufft-11.3.0.4 | 156.2 MB | ########4 | 85%  2025-05-07T20:26:06.6204758Z nsight-compute-2024. | 443.1 MB | ##8 | 28% 2025-05-07T20:26:06.6206314Z 2025-05-07T20:26:06.6415021Z libcublas-12.6.4.1 | 256.2 MB | #####3 | 53%  2025-05-07T20:26:06.6415284Z 2025-05-07T20:26:06.6415289Z 2025-05-07T20:26:06.6661761Z libcufft-11.3.0.4 | 156.2 MB | ########7 | 87%  2025-05-07T20:26:06.7224904Z nsight-compute-2024. | 443.1 MB | ##9 | 29% 2025-05-07T20:26:06.7228807Z 2025-05-07T20:26:06.7415102Z libcublas-12.6.4.1 | 256.2 MB | #####4 | 55%  2025-05-07T20:26:06.7415360Z 2025-05-07T20:26:06.7417277Z 2025-05-07T20:26:06.7734663Z libcufft-11.3.0.4 | 156.2 MB | ######### | 90%  2025-05-07T20:26:06.8229099Z nsight-compute-2024. | 443.1 MB | ### | 30% 2025-05-07T20:26:06.8229565Z 2025-05-07T20:26:06.8417942Z libcublas-12.6.4.1 | 256.2 MB | #####6 | 56%  2025-05-07T20:26:06.8418208Z 2025-05-07T20:26:06.8419409Z 2025-05-07T20:26:06.8737743Z libcufft-11.3.0.4 | 156.2 MB | #########2 | 93%  2025-05-07T20:26:06.9324162Z nsight-compute-2024. | 443.1 MB | ### | 31% 2025-05-07T20:26:06.9326379Z 2025-05-07T20:26:06.9420941Z libcublas-12.6.4.1 | 256.2 MB | #####8 | 58%  2025-05-07T20:26:06.9421209Z 2025-05-07T20:26:06.9422819Z 2025-05-07T20:26:06.9740829Z libcufft-11.3.0.4 | 156.2 MB | #########5 | 96%  2025-05-07T20:26:07.0325785Z nsight-compute-2024. | 443.1 MB | ###1 | 32% 2025-05-07T20:26:07.0326772Z 2025-05-07T20:26:07.0421266Z libcublas-12.6.4.1 | 256.2 MB | #####9 | 60%  2025-05-07T20:26:07.0421704Z 2025-05-07T20:26:07.0421710Z 2025-05-07T20:26:07.0742857Z libcufft-11.3.0.4 | 156.2 MB | #########8 | 99%  2025-05-07T20:26:07.1327793Z nsight-compute-2024. | 443.1 MB | ###2 | 33% 2025-05-07T20:26:07.1328800Z 2025-05-07T20:26:07.1744904Z libcublas-12.6.4.1 | 256.2 MB | ######1 | 62%  2025-05-07T20:26:07.2328357Z nsight-compute-2024. | 443.1 MB | ###4 | 34% 2025-05-07T20:26:07.2328650Z 2025-05-07T20:26:07.3196164Z libcublas-12.6.4.1 | 256.2 MB | ######4 | 64%  2025-05-07T20:26:07.3330207Z nsight-compute-2024. | 443.1 MB | ###5 | 35% 2025-05-07T20:26:07.3330557Z 2025-05-07T20:26:07.4289935Z libcublas-12.6.4.1 | 256.2 MB | ######6 | 67%  2025-05-07T20:26:07.4333373Z nsight-compute-2024. | 443.1 MB | ###6 | 36% 2025-05-07T20:26:07.4334034Z 2025-05-07T20:26:07.5334026Z libcublas-12.6.4.1 | 256.2 MB | ######9 | 69%  2025-05-07T20:26:07.5337160Z 2025-05-07T20:26:07.5411490Z libcublas-12.6.4.1 | 256.2 MB | #######1 | 72%  2025-05-07T20:26:07.6411644Z nsight-compute-2024. | 443.1 MB | ###6 | 37% 2025-05-07T20:26:07.6418723Z nsight-compute-2024. | 443.1 MB | ###8 | 38% 2025-05-07T20:26:07.6419122Z 2025-05-07T20:26:07.7413528Z libcublas-12.6.4.1 | 256.2 MB | #######3 | 74%  2025-05-07T20:26:07.7469319Z nsight-compute-2024. | 443.1 MB | ###9 | 39% 2025-05-07T20:26:07.7469571Z 2025-05-07T20:26:07.8472424Z libcublas-12.6.4.1 | 256.2 MB | #######6 | 76%  2025-05-07T20:26:07.8472795Z 2025-05-07T20:26:07.9475550Z libcublas-12.6.4.1 | 256.2 MB | #######9 | 80%  2025-05-07T20:26:07.9475811Z 2025-05-07T20:26:08.0014030Z libcublas-12.6.4.1 | 256.2 MB | ########2 | 83%  2025-05-07T20:26:08.0781331Z nsight-compute-2024. | 443.1 MB | #### | 40% 2025-05-07T20:26:08.0781772Z 2025-05-07T20:26:08.1014091Z libcublas-12.6.4.1 | 256.2 MB | ########5 | 85%  2025-05-07T20:26:08.1212592Z nsight-compute-2024. | 443.1 MB | ####1 | 42% 2025-05-07T20:26:08.1212852Z 2025-05-07T20:26:08.1212856Z 2025-05-07T20:26:08.1212860Z 2025-05-07T20:26:08.1220191Z 2025-05-07T20:26:08.1735218Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:26:08.1735527Z 2025-05-07T20:26:08.1735531Z 2025-05-07T20:26:08.1735535Z 2025-05-07T20:26:08.1735539Z 2025-05-07T20:26:08.1737471Z 2025-05-07T20:26:08.2015279Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:26:08.2190343Z nsight-compute-2024. | 443.1 MB | ####2 | 43% 2025-05-07T20:26:08.2191810Z 2025-05-07T20:26:08.2736193Z libcublas-12.6.4.1 | 256.2 MB | ########7 | 88%  2025-05-07T20:26:08.2736467Z 2025-05-07T20:26:08.2736471Z 2025-05-07T20:26:08.2736506Z 2025-05-07T20:26:08.2736522Z 2025-05-07T20:26:08.2737234Z 2025-05-07T20:26:08.3181409Z cuda-nvvp-12.6.80 | 109.3 MB | 3 | 4%  2025-05-07T20:26:08.3713818Z nsight-compute-2024. | 443.1 MB | ####3 | 43% 2025-05-07T20:26:08.3716570Z 2025-05-07T20:26:08.3736048Z libcublas-12.6.4.1 | 256.2 MB | ########9 | 90%  2025-05-07T20:26:08.3736306Z 2025-05-07T20:26:08.3736321Z 2025-05-07T20:26:08.3736325Z 2025-05-07T20:26:08.3736328Z 2025-05-07T20:26:08.3737381Z 2025-05-07T20:26:08.4291482Z cuda-nvvp-12.6.80 | 109.3 MB | 7 | 7%  2025-05-07T20:26:08.4743315Z nsight-compute-2024. | 443.1 MB | ####4 | 44% 2025-05-07T20:26:08.4743589Z 2025-05-07T20:26:08.4743593Z 2025-05-07T20:26:08.4743596Z 2025-05-07T20:26:08.4743601Z 2025-05-07T20:26:08.4745688Z 2025-05-07T20:26:08.5039356Z cuda-nvvp-12.6.80 | 109.3 MB | # | 11%  2025-05-07T20:26:08.5039647Z 2025-05-07T20:26:08.5039651Z 2025-05-07T20:26:08.5039684Z 2025-05-07T20:26:08.5108589Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:26:08.5108873Z 2025-05-07T20:26:08.5444705Z libcublas-12.6.4.1 | 256.2 MB | #########2 | 92%  2025-05-07T20:26:08.5489694Z nsight-compute-2024. | 443.1 MB | ####5 | 45% 2025-05-07T20:26:08.5489960Z 2025-05-07T20:26:08.5489964Z 2025-05-07T20:26:08.5489968Z 2025-05-07T20:26:08.5489971Z 2025-05-07T20:26:08.5489975Z 2025-05-07T20:26:08.5489979Z 2025-05-07T20:26:08.5793277Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:26:08.5793598Z 2025-05-07T20:26:08.5793603Z 2025-05-07T20:26:08.5793606Z 2025-05-07T20:26:08.5793610Z 2025-05-07T20:26:08.5793614Z 2025-05-07T20:26:08.6499114Z cuda-nvvp-12.6.80 | 109.3 MB | #3 | 14%  2025-05-07T20:26:08.6499430Z 2025-05-07T20:26:08.6499434Z 2025-05-07T20:26:08.6499438Z 2025-05-07T20:26:08.6499442Z 2025-05-07T20:26:08.6499450Z 2025-05-07T20:26:08.6500102Z 2025-05-07T20:26:08.6648080Z libcusolver-11.7.1.2 | 95.8 MB | 2 | 3%  2025-05-07T20:26:08.6648402Z 2025-05-07T20:26:08.6771229Z libcublas-12.6.4.1 | 256.2 MB | #########3 | 94%  2025-05-07T20:26:08.6801128Z nsight-compute-2024. | 443.1 MB | ####6 | 46% 2025-05-07T20:26:08.6801387Z 2025-05-07T20:26:08.6801391Z 2025-05-07T20:26:08.6801395Z 2025-05-07T20:26:08.6801399Z 2025-05-07T20:26:08.6801411Z 2025-05-07T20:26:08.7503596Z cuda-nvvp-12.6.80 | 109.3 MB | #6 | 17%  2025-05-07T20:26:08.7503947Z 2025-05-07T20:26:08.7503952Z 2025-05-07T20:26:08.7503956Z 2025-05-07T20:26:08.7503960Z 2025-05-07T20:26:08.7503973Z 2025-05-07T20:26:08.7503977Z 2025-05-07T20:26:08.7952971Z libcusolver-11.7.1.2 | 95.8 MB | 5 | 6%  2025-05-07T20:26:08.7953313Z 2025-05-07T20:26:08.7953318Z 2025-05-07T20:26:08.7953321Z 2025-05-07T20:26:08.7953336Z 2025-05-07T20:26:08.7953340Z 2025-05-07T20:26:08.7975953Z cuda-nvvp-12.6.80 | 109.3 MB | #9 | 20%  2025-05-07T20:26:08.8072713Z nsight-compute-2024. | 443.1 MB | ####6 | 47% 2025-05-07T20:26:08.8073097Z 2025-05-07T20:26:08.8508199Z libcublas-12.6.4.1 | 256.2 MB | #########5 | 96%  2025-05-07T20:26:08.8508633Z 2025-05-07T20:26:08.8508640Z 2025-05-07T20:26:08.8508646Z 2025-05-07T20:26:08.8508652Z 2025-05-07T20:26:08.8508658Z 2025-05-07T20:26:08.8512816Z 2025-05-07T20:26:08.8958819Z libcusolver-11.7.1.2 | 95.8 MB | 8 | 9%  2025-05-07T20:26:08.8959146Z 2025-05-07T20:26:08.8959151Z 2025-05-07T20:26:08.8959154Z 2025-05-07T20:26:08.8959158Z 2025-05-07T20:26:08.8960188Z 2025-05-07T20:26:08.9077305Z cuda-nvvp-12.6.80 | 109.3 MB | ##2 | 23%  2025-05-07T20:26:08.9301798Z nsight-compute-2024. | 443.1 MB | ####7 | 48% 2025-05-07T20:26:08.9303304Z 2025-05-07T20:26:08.9514656Z libcublas-12.6.4.1 | 256.2 MB | #########7 | 97%  2025-05-07T20:26:08.9515031Z 2025-05-07T20:26:08.9515067Z 2025-05-07T20:26:08.9515089Z 2025-05-07T20:26:08.9515094Z 2025-05-07T20:26:08.9515100Z 2025-05-07T20:26:08.9519522Z 2025-05-07T20:26:08.9964413Z libcusolver-11.7.1.2 | 95.8 MB | #2 | 12%  2025-05-07T20:26:08.9964750Z 2025-05-07T20:26:08.9964754Z 2025-05-07T20:26:08.9964758Z 2025-05-07T20:26:08.9964762Z 2025-05-07T20:26:08.9967305Z 2025-05-07T20:26:09.0202810Z cuda-nvvp-12.6.80 | 109.3 MB | ##5 | 26%  2025-05-07T20:26:09.0514737Z nsight-compute-2024. | 443.1 MB | ####8 | 48% 2025-05-07T20:26:09.0515099Z 2025-05-07T20:26:09.0515104Z 2025-05-07T20:26:09.0515107Z 2025-05-07T20:26:09.0515111Z 2025-05-07T20:26:09.0515115Z 2025-05-07T20:26:09.0516900Z 2025-05-07T20:26:09.0640689Z libcusolver-11.7.1.2 | 95.8 MB | #5 | 15%  2025-05-07T20:26:09.0641679Z 2025-05-07T20:26:09.1098732Z libcublas-12.6.4.1 | 256.2 MB | #########8 | 99%  2025-05-07T20:26:09.1099126Z 2025-05-07T20:26:09.1099132Z 2025-05-07T20:26:09.1099475Z 2025-05-07T20:26:09.1099479Z 2025-05-07T20:26:09.1099483Z 2025-05-07T20:26:09.1304246Z cuda-nvvp-12.6.80 | 109.3 MB | ##8 | 29%  2025-05-07T20:26:09.1515139Z nsight-compute-2024. | 443.1 MB | ####9 | 49% 2025-05-07T20:26:09.1515512Z 2025-05-07T20:26:09.1515518Z 2025-05-07T20:26:09.1515524Z 2025-05-07T20:26:09.1515529Z 2025-05-07T20:26:09.1515534Z 2025-05-07T20:26:09.1517010Z 2025-05-07T20:26:09.1727423Z libcusolver-11.7.1.2 | 95.8 MB | #8 | 18%  2025-05-07T20:26:09.1728921Z 2025-05-07T20:26:09.2200999Z libcublas-12.6.4.1 | 256.2 MB | #########9 | 100%  2025-05-07T20:26:09.2201354Z 2025-05-07T20:26:09.2201359Z 2025-05-07T20:26:09.2201375Z 2025-05-07T20:26:09.2201381Z 2025-05-07T20:26:09.2202801Z 2025-05-07T20:26:09.2518069Z cuda-nvvp-12.6.80 | 109.3 MB | ###1 | 31%  2025-05-07T20:26:09.2518463Z 2025-05-07T20:26:09.2518469Z 2025-05-07T20:26:09.2518474Z 2025-05-07T20:26:09.2518511Z 2025-05-07T20:26:09.2518533Z 2025-05-07T20:26:09.2521596Z 2025-05-07T20:26:09.2549057Z libcusolver-11.7.1.2 | 95.8 MB | ##1 | 22%  2025-05-07T20:26:09.3202001Z nsight-compute-2024. | 443.1 MB | ####9 | 50% 2025-05-07T20:26:09.3202357Z 2025-05-07T20:26:09.3202363Z 2025-05-07T20:26:09.3202368Z 2025-05-07T20:26:09.3202373Z 2025-05-07T20:26:09.3202670Z 2025-05-07T20:26:09.3552646Z cuda-nvvp-12.6.80 | 109.3 MB | ###4 | 34%  2025-05-07T20:26:09.3562089Z nsight-compute-2024. | 443.1 MB | ##### | 50% 2025-05-07T20:26:09.3562450Z 2025-05-07T20:26:09.3562456Z 2025-05-07T20:26:09.3562461Z 2025-05-07T20:26:09.3562466Z 2025-05-07T20:26:09.3562471Z 2025-05-07T20:26:09.3569379Z 2025-05-07T20:26:09.4282420Z libcusolver-11.7.1.2 | 95.8 MB | ##4 | 25%  2025-05-07T20:26:09.4282735Z 2025-05-07T20:26:09.4282739Z 2025-05-07T20:26:09.4282743Z 2025-05-07T20:26:09.4282747Z 2025-05-07T20:26:09.4285575Z 2025-05-07T20:26:09.4670573Z cuda-nvvp-12.6.80 | 109.3 MB | ###7 | 37%  2025-05-07T20:26:09.4670888Z 2025-05-07T20:26:09.4670892Z 2025-05-07T20:26:09.4670896Z 2025-05-07T20:26:09.4670900Z 2025-05-07T20:26:09.4670903Z 2025-05-07T20:26:09.4670907Z 2025-05-07T20:26:09.4679905Z libcusolver-11.7.1.2 | 95.8 MB | ##8 | 28%  2025-05-07T20:26:09.5283858Z nsight-compute-2024. | 443.1 MB | #####1 | 51% 2025-05-07T20:26:09.5284132Z 2025-05-07T20:26:09.5284136Z 2025-05-07T20:26:09.5284140Z 2025-05-07T20:26:09.5284144Z 2025-05-07T20:26:09.5286757Z 2025-05-07T20:26:09.5676937Z cuda-nvvp-12.6.80 | 109.3 MB | ###9 | 40%  2025-05-07T20:26:09.5677234Z 2025-05-07T20:26:09.5677238Z 2025-05-07T20:26:09.5677242Z 2025-05-07T20:26:09.5677246Z 2025-05-07T20:26:09.5677250Z 2025-05-07T20:26:09.5678385Z 2025-05-07T20:26:09.5683373Z libcusolver-11.7.1.2 | 95.8 MB | ###1 | 32%  2025-05-07T20:26:09.6286578Z nsight-compute-2024. | 443.1 MB | #####1 | 52% 2025-05-07T20:26:09.6286872Z 2025-05-07T20:26:09.6286879Z 2025-05-07T20:26:09.6286884Z 2025-05-07T20:26:09.6286889Z 2025-05-07T20:26:09.6288821Z 2025-05-07T20:26:09.6680427Z cuda-nvvp-12.6.80 | 109.3 MB | ####3 | 43%  2025-05-07T20:26:09.6680817Z 2025-05-07T20:26:09.6680823Z 2025-05-07T20:26:09.6680828Z 2025-05-07T20:26:09.6680843Z 2025-05-07T20:26:09.6680848Z 2025-05-07T20:26:09.6680853Z 2025-05-07T20:26:09.6697049Z libcusolver-11.7.1.2 | 95.8 MB | ###4 | 35%  2025-05-07T20:26:09.7349677Z nsight-compute-2024. | 443.1 MB | #####2 | 53% 2025-05-07T20:26:09.7350027Z 2025-05-07T20:26:09.7350032Z 2025-05-07T20:26:09.7350038Z 2025-05-07T20:26:09.7350042Z 2025-05-07T20:26:09.7352035Z 2025-05-07T20:26:09.7684853Z cuda-nvvp-12.6.80 | 109.3 MB | ####6 | 46%  2025-05-07T20:26:09.7685152Z 2025-05-07T20:26:09.7685156Z 2025-05-07T20:26:09.7685160Z 2025-05-07T20:26:09.7685163Z 2025-05-07T20:26:09.7685440Z 2025-05-07T20:26:09.7685444Z 2025-05-07T20:26:09.7732645Z libcusolver-11.7.1.2 | 95.8 MB | ###8 | 38%  2025-05-07T20:26:09.8350900Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:26:09.8351163Z 2025-05-07T20:26:09.8351415Z 2025-05-07T20:26:09.8351446Z 2025-05-07T20:26:09.8351452Z 2025-05-07T20:26:09.8352818Z 2025-05-07T20:26:09.8687928Z cuda-nvvp-12.6.80 | 109.3 MB | ####9 | 49%  2025-05-07T20:26:09.8688377Z 2025-05-07T20:26:09.8688392Z 2025-05-07T20:26:09.8688397Z 2025-05-07T20:26:09.8688402Z 2025-05-07T20:26:09.8688407Z 2025-05-07T20:26:09.8688412Z 2025-05-07T20:26:09.8738469Z libcusolver-11.7.1.2 | 95.8 MB | ####1 | 42%  2025-05-07T20:26:09.9355322Z nsight-compute-2024. | 443.1 MB | #####4 | 54% 2025-05-07T20:26:09.9357409Z 2025-05-07T20:26:09.9357416Z 2025-05-07T20:26:09.9357421Z 2025-05-07T20:26:09.9357426Z 2025-05-07T20:26:09.9357431Z 2025-05-07T20:26:09.9692593Z cuda-nvvp-12.6.80 | 109.3 MB | #####2 | 52%  2025-05-07T20:26:09.9692987Z 2025-05-07T20:26:09.9692993Z 2025-05-07T20:26:09.9692998Z 2025-05-07T20:26:09.9693003Z 2025-05-07T20:26:09.9693008Z 2025-05-07T20:26:09.9698372Z 2025-05-07T20:26:09.9743730Z libcusolver-11.7.1.2 | 95.8 MB | ####5 | 45%  2025-05-07T20:26:10.0359514Z nsight-compute-2024. | 443.1 MB | #####4 | 55% 2025-05-07T20:26:10.0359860Z 2025-05-07T20:26:10.0359867Z 2025-05-07T20:26:10.0359872Z 2025-05-07T20:26:10.0359878Z 2025-05-07T20:26:10.0365963Z 2025-05-07T20:26:10.0693529Z cuda-nvvp-12.6.80 | 109.3 MB | #####5 | 55%  2025-05-07T20:26:10.0693823Z 2025-05-07T20:26:10.0693829Z 2025-05-07T20:26:10.0693836Z 2025-05-07T20:26:10.0693841Z 2025-05-07T20:26:10.0693847Z 2025-05-07T20:26:10.0693852Z 2025-05-07T20:26:10.0748839Z libcusolver-11.7.1.2 | 95.8 MB | ####8 | 49%  2025-05-07T20:26:10.1402253Z nsight-compute-2024. | 443.1 MB | #####5 | 55% 2025-05-07T20:26:10.1402681Z 2025-05-07T20:26:10.1402688Z 2025-05-07T20:26:10.1402696Z 2025-05-07T20:26:10.1402702Z 2025-05-07T20:26:10.1402708Z 2025-05-07T20:26:10.1806164Z cuda-nvvp-12.6.80 | 109.3 MB | #####8 | 58%  2025-05-07T20:26:10.1857392Z nsight-compute-2024. | 443.1 MB | #####6 | 56% 2025-05-07T20:26:10.1857766Z 2025-05-07T20:26:10.1857771Z 2025-05-07T20:26:10.1857776Z 2025-05-07T20:26:10.1857780Z 2025-05-07T20:26:10.1857785Z 2025-05-07T20:26:10.1858502Z 2025-05-07T20:26:10.2409800Z libcusolver-11.7.1.2 | 95.8 MB | #####2 | 52%  2025-05-07T20:26:10.2410126Z 2025-05-07T20:26:10.2410131Z 2025-05-07T20:26:10.2410134Z 2025-05-07T20:26:10.2410138Z 2025-05-07T20:26:10.2410142Z 2025-05-07T20:26:10.2807921Z cuda-nvvp-12.6.80 | 109.3 MB | ######1 | 61%  2025-05-07T20:26:10.2857571Z nsight-compute-2024. | 443.1 MB | #####7 | 57% 2025-05-07T20:26:10.2857862Z 2025-05-07T20:26:10.2858112Z 2025-05-07T20:26:10.2858136Z 2025-05-07T20:26:10.2858146Z 2025-05-07T20:26:10.2858151Z 2025-05-07T20:26:10.2858156Z 2025-05-07T20:26:10.3412018Z libcusolver-11.7.1.2 | 95.8 MB | #####5 | 56%  2025-05-07T20:26:10.3412339Z 2025-05-07T20:26:10.3412345Z 2025-05-07T20:26:10.3412349Z 2025-05-07T20:26:10.3412353Z 2025-05-07T20:26:10.3412644Z 2025-05-07T20:26:10.3860671Z cuda-nvvp-12.6.80 | 109.3 MB | ######4 | 65%  2025-05-07T20:26:10.3860986Z 2025-05-07T20:26:10.3860992Z 2025-05-07T20:26:10.3860998Z 2025-05-07T20:26:10.3861003Z 2025-05-07T20:26:10.3861008Z 2025-05-07T20:26:10.3862327Z 2025-05-07T20:26:10.3877804Z libcusolver-11.7.1.2 | 95.8 MB | #####8 | 59%  2025-05-07T20:26:10.4455287Z nsight-compute-2024. | 443.1 MB | #####7 | 58% 2025-05-07T20:26:10.4455650Z 2025-05-07T20:26:10.4455655Z 2025-05-07T20:26:10.4455658Z 2025-05-07T20:26:10.4455662Z 2025-05-07T20:26:10.4458076Z 2025-05-07T20:26:10.4543453Z cuda-nvvp-12.6.80 | 109.3 MB | ######7 | 68%  2025-05-07T20:26:10.4544157Z 2025-05-07T20:26:10.4546767Z 2025-05-07T20:26:10.4923660Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:26:10.4928524Z nsight-compute-2024. | 443.1 MB | #####8 | 58% 2025-05-07T20:26:10.4928853Z 2025-05-07T20:26:10.4928859Z 2025-05-07T20:26:10.4928865Z 2025-05-07T20:26:10.4928870Z 2025-05-07T20:26:10.4928875Z 2025-05-07T20:26:10.4928881Z 2025-05-07T20:26:10.4933352Z 2025-05-07T20:26:10.5038219Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:26:10.5038505Z 2025-05-07T20:26:10.5038509Z 2025-05-07T20:26:10.5038514Z 2025-05-07T20:26:10.5038517Z 2025-05-07T20:26:10.5038521Z 2025-05-07T20:26:10.5038537Z 2025-05-07T20:26:10.5708642Z libcusolver-11.7.1.2 | 95.8 MB | ######2 | 62%  2025-05-07T20:26:10.5708953Z 2025-05-07T20:26:10.5708957Z 2025-05-07T20:26:10.5708961Z 2025-05-07T20:26:10.5708965Z 2025-05-07T20:26:10.5708999Z 2025-05-07T20:26:10.5932417Z cuda-nvvp-12.6.80 | 109.3 MB | ####### | 71%  2025-05-07T20:26:10.5932707Z 2025-05-07T20:26:10.5932711Z 2025-05-07T20:26:10.5932715Z 2025-05-07T20:26:10.5932726Z 2025-05-07T20:26:10.5932730Z 2025-05-07T20:26:10.5932734Z 2025-05-07T20:26:10.5937104Z 2025-05-07T20:26:10.6178810Z libnpp-12.3.1.54 | 93.4 MB | 2 | 3%  2025-05-07T20:26:10.6233601Z nsight-compute-2024. | 443.1 MB | #####9 | 59% 2025-05-07T20:26:10.6233862Z 2025-05-07T20:26:10.6233867Z 2025-05-07T20:26:10.6233870Z 2025-05-07T20:26:10.6233874Z 2025-05-07T20:26:10.6233878Z 2025-05-07T20:26:10.6233885Z 2025-05-07T20:26:10.6877936Z libcusolver-11.7.1.2 | 95.8 MB | ######5 | 65%  2025-05-07T20:26:10.6878255Z 2025-05-07T20:26:10.6878259Z 2025-05-07T20:26:10.6878264Z 2025-05-07T20:26:10.6878267Z 2025-05-07T20:26:10.6878271Z 2025-05-07T20:26:10.6937369Z cuda-nvvp-12.6.80 | 109.3 MB | #######3 | 73%  2025-05-07T20:26:10.6937680Z 2025-05-07T20:26:10.6937684Z 2025-05-07T20:26:10.6937688Z 2025-05-07T20:26:10.6937692Z 2025-05-07T20:26:10.6937696Z 2025-05-07T20:26:10.6937714Z 2025-05-07T20:26:10.6939372Z 2025-05-07T20:26:10.7395957Z libnpp-12.3.1.54 | 93.4 MB | 5 | 6%  2025-05-07T20:26:10.7396277Z 2025-05-07T20:26:10.7396283Z 2025-05-07T20:26:10.7396288Z 2025-05-07T20:26:10.7396293Z 2025-05-07T20:26:10.7396298Z 2025-05-07T20:26:10.7399930Z 2025-05-07T20:26:10.7483524Z libcusolver-11.7.1.2 | 95.8 MB | ######8 | 68%  2025-05-07T20:26:10.7939657Z nsight-compute-2024. | 443.1 MB | #####9 | 60% 2025-05-07T20:26:10.7939956Z 2025-05-07T20:26:10.7939963Z 2025-05-07T20:26:10.7939969Z 2025-05-07T20:26:10.7939975Z 2025-05-07T20:26:10.7939981Z 2025-05-07T20:26:10.7939987Z 2025-05-07T20:26:10.7941515Z 2025-05-07T20:26:10.8026597Z libnpp-12.3.1.54 | 93.4 MB | 8 | 9%  2025-05-07T20:26:10.8026919Z 2025-05-07T20:26:10.8026934Z 2025-05-07T20:26:10.8026938Z 2025-05-07T20:26:10.8026942Z 2025-05-07T20:26:10.8030498Z 2025-05-07T20:26:10.8439876Z cuda-nvvp-12.6.80 | 109.3 MB | #######6 | 76%  2025-05-07T20:26:10.8440181Z 2025-05-07T20:26:10.8440187Z 2025-05-07T20:26:10.8440193Z 2025-05-07T20:26:10.8440199Z 2025-05-07T20:26:10.8440204Z 2025-05-07T20:26:10.8440757Z 2025-05-07T20:26:10.8546610Z libcusolver-11.7.1.2 | 95.8 MB | #######1 | 71%  2025-05-07T20:26:10.8942796Z nsight-compute-2024. | 443.1 MB | ###### | 60% 2025-05-07T20:26:10.8943072Z 2025-05-07T20:26:10.8943077Z 2025-05-07T20:26:10.8943081Z 2025-05-07T20:26:10.8943088Z 2025-05-07T20:26:10.8943092Z 2025-05-07T20:26:10.8943101Z 2025-05-07T20:26:10.8943108Z 2025-05-07T20:26:10.9065567Z libnpp-12.3.1.54 | 93.4 MB | #1 | 12%  2025-05-07T20:26:10.9065859Z 2025-05-07T20:26:10.9065863Z 2025-05-07T20:26:10.9065867Z 2025-05-07T20:26:10.9065897Z 2025-05-07T20:26:10.9066176Z 2025-05-07T20:26:10.9479393Z cuda-nvvp-12.6.80 | 109.3 MB | #######8 | 79%  2025-05-07T20:26:10.9479772Z 2025-05-07T20:26:10.9479776Z 2025-05-07T20:26:10.9479780Z 2025-05-07T20:26:10.9479783Z 2025-05-07T20:26:10.9479787Z 2025-05-07T20:26:10.9479791Z 2025-05-07T20:26:10.9687690Z libcusolver-11.7.1.2 | 95.8 MB | #######4 | 74%  2025-05-07T20:26:10.9943026Z nsight-compute-2024. | 443.1 MB | ######1 | 61% 2025-05-07T20:26:10.9943315Z 2025-05-07T20:26:10.9943321Z 2025-05-07T20:26:10.9943326Z 2025-05-07T20:26:10.9943331Z 2025-05-07T20:26:10.9943336Z 2025-05-07T20:26:10.9943342Z 2025-05-07T20:26:10.9943347Z 2025-05-07T20:26:11.0175358Z libnpp-12.3.1.54 | 93.4 MB | #4 | 14%  2025-05-07T20:26:11.0175729Z 2025-05-07T20:26:11.0175734Z 2025-05-07T20:26:11.0175738Z 2025-05-07T20:26:11.0175742Z 2025-05-07T20:26:11.0178601Z 2025-05-07T20:26:11.0502389Z cuda-nvvp-12.6.80 | 109.3 MB | ########1 | 81%  2025-05-07T20:26:11.0502692Z 2025-05-07T20:26:11.0502697Z 2025-05-07T20:26:11.0502701Z 2025-05-07T20:26:11.0502704Z 2025-05-07T20:26:11.0502716Z 2025-05-07T20:26:11.0502720Z 2025-05-07T20:26:11.0757792Z libcusolver-11.7.1.2 | 95.8 MB | #######7 | 77%  2025-05-07T20:26:11.0949725Z nsight-compute-2024. | 443.1 MB | ######1 | 62% 2025-05-07T20:26:11.0949995Z 2025-05-07T20:26:11.0949999Z 2025-05-07T20:26:11.0950003Z 2025-05-07T20:26:11.0950006Z 2025-05-07T20:26:11.0950010Z 2025-05-07T20:26:11.0950014Z 2025-05-07T20:26:11.0952665Z 2025-05-07T20:26:11.1295479Z libnpp-12.3.1.54 | 93.4 MB | #7 | 18%  2025-05-07T20:26:11.1295940Z 2025-05-07T20:26:11.1295946Z 2025-05-07T20:26:11.1295951Z 2025-05-07T20:26:11.1295956Z 2025-05-07T20:26:11.1295962Z 2025-05-07T20:26:11.1507677Z cuda-nvvp-12.6.80 | 109.3 MB | ########3 | 84%  2025-05-07T20:26:11.1507980Z 2025-05-07T20:26:11.1507984Z 2025-05-07T20:26:11.1508249Z 2025-05-07T20:26:11.1508255Z 2025-05-07T20:26:11.1508258Z 2025-05-07T20:26:11.1508985Z 2025-05-07T20:26:11.1762232Z libcusolver-11.7.1.2 | 95.8 MB | #######9 | 80%  2025-05-07T20:26:11.2051766Z nsight-compute-2024. | 443.1 MB | ######2 | 62% 2025-05-07T20:26:11.2052167Z 2025-05-07T20:26:11.2052174Z 2025-05-07T20:26:11.2052180Z 2025-05-07T20:26:11.2052185Z 2025-05-07T20:26:11.2052191Z 2025-05-07T20:26:11.2052197Z 2025-05-07T20:26:11.2054875Z 2025-05-07T20:26:11.2378788Z libnpp-12.3.1.54 | 93.4 MB | ## | 21%  2025-05-07T20:26:11.2379086Z 2025-05-07T20:26:11.2379090Z 2025-05-07T20:26:11.2379093Z 2025-05-07T20:26:11.2379097Z 2025-05-07T20:26:11.2380239Z 2025-05-07T20:26:11.2507448Z cuda-nvvp-12.6.80 | 109.3 MB | ########6 | 86%  2025-05-07T20:26:11.2507743Z 2025-05-07T20:26:11.2507747Z 2025-05-07T20:26:11.2507751Z 2025-05-07T20:26:11.2507755Z 2025-05-07T20:26:11.2507758Z 2025-05-07T20:26:11.2509944Z 2025-05-07T20:26:11.2766036Z libcusolver-11.7.1.2 | 95.8 MB | ########3 | 83%  2025-05-07T20:26:11.3244427Z nsight-compute-2024. | 443.1 MB | ######2 | 63% 2025-05-07T20:26:11.3244704Z 2025-05-07T20:26:11.3244709Z 2025-05-07T20:26:11.3244712Z 2025-05-07T20:26:11.3244716Z 2025-05-07T20:26:11.3244720Z 2025-05-07T20:26:11.3244724Z 2025-05-07T20:26:11.3246472Z 2025-05-07T20:26:11.3383078Z libnpp-12.3.1.54 | 93.4 MB | ##3 | 23%  2025-05-07T20:26:11.3383368Z 2025-05-07T20:26:11.3383372Z 2025-05-07T20:26:11.3383376Z 2025-05-07T20:26:11.3383380Z 2025-05-07T20:26:11.3383384Z 2025-05-07T20:26:11.3510062Z cuda-nvvp-12.6.80 | 109.3 MB | ########8 | 89%  2025-05-07T20:26:11.3510350Z 2025-05-07T20:26:11.3510354Z 2025-05-07T20:26:11.3510359Z 2025-05-07T20:26:11.3510363Z 2025-05-07T20:26:11.3510366Z 2025-05-07T20:26:11.3510378Z 2025-05-07T20:26:11.3769357Z libcusolver-11.7.1.2 | 95.8 MB | ########6 | 86%  2025-05-07T20:26:11.4342730Z nsight-compute-2024. | 443.1 MB | ######3 | 64% 2025-05-07T20:26:11.4343011Z 2025-05-07T20:26:11.4343015Z 2025-05-07T20:26:11.4343019Z 2025-05-07T20:26:11.4343023Z 2025-05-07T20:26:11.4343026Z 2025-05-07T20:26:11.4343030Z 2025-05-07T20:26:11.4347361Z 2025-05-07T20:26:11.4448070Z libnpp-12.3.1.54 | 93.4 MB | ##6 | 26%  2025-05-07T20:26:11.4448360Z 2025-05-07T20:26:11.4448365Z 2025-05-07T20:26:11.4448369Z 2025-05-07T20:26:11.4448372Z 2025-05-07T20:26:11.4450457Z 2025-05-07T20:26:11.4511467Z cuda-nvvp-12.6.80 | 109.3 MB | #########1 | 91%  2025-05-07T20:26:11.4511857Z 2025-05-07T20:26:11.4511861Z 2025-05-07T20:26:11.4511865Z 2025-05-07T20:26:11.4511869Z 2025-05-07T20:26:11.4511873Z 2025-05-07T20:26:11.4513260Z 2025-05-07T20:26:11.4927199Z libcusolver-11.7.1.2 | 95.8 MB | ########9 | 89%  2025-05-07T20:26:11.5348665Z nsight-compute-2024. | 443.1 MB | ######4 | 64% 2025-05-07T20:26:11.5348947Z 2025-05-07T20:26:11.5348951Z 2025-05-07T20:26:11.5348955Z 2025-05-07T20:26:11.5348959Z 2025-05-07T20:26:11.5348963Z 2025-05-07T20:26:11.5348966Z 2025-05-07T20:26:11.5350869Z 2025-05-07T20:26:11.5476660Z libnpp-12.3.1.54 | 93.4 MB | ##8 | 29%  2025-05-07T20:26:11.5477002Z 2025-05-07T20:26:11.5477007Z 2025-05-07T20:26:11.5477011Z 2025-05-07T20:26:11.5477022Z 2025-05-07T20:26:11.5477026Z 2025-05-07T20:26:11.5594083Z cuda-nvvp-12.6.80 | 109.3 MB | #########3 | 93%  2025-05-07T20:26:11.5594368Z 2025-05-07T20:26:11.5594372Z 2025-05-07T20:26:11.5594376Z 2025-05-07T20:26:11.5594387Z 2025-05-07T20:26:11.5594391Z 2025-05-07T20:26:11.5602014Z 2025-05-07T20:26:11.6045002Z libcusolver-11.7.1.2 | 95.8 MB | #########2 | 92%  2025-05-07T20:26:11.6350412Z nsight-compute-2024. | 443.1 MB | ######4 | 65% 2025-05-07T20:26:11.6350745Z 2025-05-07T20:26:11.6350749Z 2025-05-07T20:26:11.6351014Z 2025-05-07T20:26:11.6351038Z 2025-05-07T20:26:11.6351044Z 2025-05-07T20:26:11.6351049Z 2025-05-07T20:26:11.6352320Z 2025-05-07T20:26:11.6479682Z libnpp-12.3.1.54 | 93.4 MB | ###1 | 32%  2025-05-07T20:26:11.6480038Z 2025-05-07T20:26:11.6480043Z 2025-05-07T20:26:11.6480046Z 2025-05-07T20:26:11.6480059Z 2025-05-07T20:26:11.6483491Z 2025-05-07T20:26:11.6627519Z cuda-nvvp-12.6.80 | 109.3 MB | #########5 | 96%  2025-05-07T20:26:11.6627806Z 2025-05-07T20:26:11.6627811Z 2025-05-07T20:26:11.6627821Z 2025-05-07T20:26:11.6627825Z 2025-05-07T20:26:11.6627829Z 2025-05-07T20:26:11.6635747Z 2025-05-07T20:26:11.7047291Z libcusolver-11.7.1.2 | 95.8 MB | #########5 | 95%  2025-05-07T20:26:11.7353460Z nsight-compute-2024. | 443.1 MB | ######5 | 65% 2025-05-07T20:26:11.7353729Z 2025-05-07T20:26:11.7353733Z 2025-05-07T20:26:11.7353737Z 2025-05-07T20:26:11.7353741Z 2025-05-07T20:26:11.7353745Z 2025-05-07T20:26:11.7353749Z 2025-05-07T20:26:11.7357938Z 2025-05-07T20:26:11.7483629Z libnpp-12.3.1.54 | 93.4 MB | ###4 | 34%  2025-05-07T20:26:11.7483942Z 2025-05-07T20:26:11.7483948Z 2025-05-07T20:26:11.7483954Z 2025-05-07T20:26:11.7483960Z 2025-05-07T20:26:11.7483966Z 2025-05-07T20:26:11.7629225Z cuda-nvvp-12.6.80 | 109.3 MB | #########8 | 99%  2025-05-07T20:26:11.7629546Z 2025-05-07T20:26:11.7629553Z 2025-05-07T20:26:11.7629558Z 2025-05-07T20:26:11.7629563Z 2025-05-07T20:26:11.7629569Z 2025-05-07T20:26:11.7631960Z 2025-05-07T20:26:11.8054911Z libcusolver-11.7.1.2 | 95.8 MB | #########7 | 98%  2025-05-07T20:26:11.8354079Z nsight-compute-2024. | 443.1 MB | ######6 | 66% 2025-05-07T20:26:11.8354374Z 2025-05-07T20:26:11.8354378Z 2025-05-07T20:26:11.8354381Z 2025-05-07T20:26:11.8354385Z 2025-05-07T20:26:11.8354389Z 2025-05-07T20:26:11.8354393Z 2025-05-07T20:26:11.8354397Z 2025-05-07T20:26:11.9056014Z libnpp-12.3.1.54 | 93.4 MB | ###7 | 37%  2025-05-07T20:26:11.9114943Z nsight-compute-2024. | 443.1 MB | ######6 | 67% 2025-05-07T20:26:11.9115205Z 2025-05-07T20:26:11.9115249Z 2025-05-07T20:26:11.9115253Z 2025-05-07T20:26:11.9115412Z 2025-05-07T20:26:11.9357358Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:26:11.9357676Z 2025-05-07T20:26:11.9357682Z 2025-05-07T20:26:11.9357688Z 2025-05-07T20:26:11.9357693Z 2025-05-07T20:26:11.9357699Z 2025-05-07T20:26:11.9357704Z 2025-05-07T20:26:11.9358123Z 2025-05-07T20:26:12.0060869Z libnpp-12.3.1.54 | 93.4 MB | #### | 40%  2025-05-07T20:26:12.0362557Z nsight-compute-2024. | 443.1 MB | ######7 | 68% 2025-05-07T20:26:12.0362935Z 2025-05-07T20:26:12.0362941Z 2025-05-07T20:26:12.0362947Z 2025-05-07T20:26:12.0362953Z 2025-05-07T20:26:12.0362974Z 2025-05-07T20:26:12.0362979Z 2025-05-07T20:26:12.0362985Z 2025-05-07T20:26:12.1065080Z libnpp-12.3.1.54 | 93.4 MB | ####4 | 44%  2025-05-07T20:26:12.1365777Z nsight-compute-2024. | 443.1 MB | ######8 | 68% 2025-05-07T20:26:12.1366046Z 2025-05-07T20:26:12.1366051Z 2025-05-07T20:26:12.1366055Z 2025-05-07T20:26:12.1366059Z 2025-05-07T20:26:12.1366094Z 2025-05-07T20:26:12.1366098Z 2025-05-07T20:26:12.1366161Z 2025-05-07T20:26:12.2065177Z libnpp-12.3.1.54 | 93.4 MB | ####8 | 49%  2025-05-07T20:26:12.2367544Z nsight-compute-2024. | 443.1 MB | ######9 | 69% 2025-05-07T20:26:12.2367933Z 2025-05-07T20:26:12.2367939Z 2025-05-07T20:26:12.2367946Z 2025-05-07T20:26:12.2367953Z 2025-05-07T20:26:12.2367959Z 2025-05-07T20:26:12.2367968Z 2025-05-07T20:26:12.2371273Z 2025-05-07T20:26:12.3069922Z libnpp-12.3.1.54 | 93.4 MB | #####2 | 53%  2025-05-07T20:26:12.3370139Z nsight-compute-2024. | 443.1 MB | ######9 | 70% 2025-05-07T20:26:12.3370540Z 2025-05-07T20:26:12.3370547Z 2025-05-07T20:26:12.3370552Z 2025-05-07T20:26:12.3370557Z 2025-05-07T20:26:12.3370857Z 2025-05-07T20:26:12.3370881Z 2025-05-07T20:26:12.3373027Z 2025-05-07T20:26:12.4073774Z libnpp-12.3.1.54 | 93.4 MB | #####6 | 57%  2025-05-07T20:26:12.4376150Z nsight-compute-2024. | 443.1 MB | ####### | 71% 2025-05-07T20:26:12.4376555Z 2025-05-07T20:26:12.4376560Z 2025-05-07T20:26:12.4376564Z 2025-05-07T20:26:12.4376567Z 2025-05-07T20:26:12.4376571Z 2025-05-07T20:26:12.4376575Z 2025-05-07T20:26:12.4379700Z 2025-05-07T20:26:12.5079934Z libnpp-12.3.1.54 | 93.4 MB | ######1 | 61%  2025-05-07T20:26:12.5381896Z nsight-compute-2024. | 443.1 MB | #######1 | 72% 2025-05-07T20:26:12.5382232Z 2025-05-07T20:26:12.5382238Z 2025-05-07T20:26:12.5382245Z 2025-05-07T20:26:12.5382250Z 2025-05-07T20:26:12.5382255Z 2025-05-07T20:26:12.5382262Z 2025-05-07T20:26:12.5382271Z 2025-05-07T20:26:12.6161388Z libnpp-12.3.1.54 | 93.4 MB | ######4 | 65%  2025-05-07T20:26:12.6387442Z nsight-compute-2024. | 443.1 MB | #######2 | 72% 2025-05-07T20:26:12.6387837Z 2025-05-07T20:26:12.6387843Z 2025-05-07T20:26:12.6387848Z 2025-05-07T20:26:12.6387854Z 2025-05-07T20:26:12.6387859Z 2025-05-07T20:26:12.6387865Z 2025-05-07T20:26:12.6390850Z 2025-05-07T20:26:12.7165926Z libnpp-12.3.1.54 | 93.4 MB | ######9 | 69%  2025-05-07T20:26:12.7516177Z nsight-compute-2024. | 443.1 MB | #######3 | 73% 2025-05-07T20:26:12.7516582Z 2025-05-07T20:26:12.7516589Z 2025-05-07T20:26:12.7516594Z 2025-05-07T20:26:12.7516600Z 2025-05-07T20:26:12.7516605Z 2025-05-07T20:26:12.7516611Z 2025-05-07T20:26:12.7516617Z 2025-05-07T20:26:12.8167880Z libnpp-12.3.1.54 | 93.4 MB | #######3 | 73%  2025-05-07T20:26:12.8536304Z nsight-compute-2024. | 443.1 MB | #######4 | 74% 2025-05-07T20:26:12.8536600Z 2025-05-07T20:26:12.8536604Z 2025-05-07T20:26:12.8536607Z 2025-05-07T20:26:12.8536611Z 2025-05-07T20:26:12.8536615Z 2025-05-07T20:26:12.8536619Z 2025-05-07T20:26:12.8538024Z 2025-05-07T20:26:12.9223640Z libnpp-12.3.1.54 | 93.4 MB | #######6 | 77%  2025-05-07T20:26:12.9692048Z nsight-compute-2024. | 443.1 MB | #######4 | 75% 2025-05-07T20:26:12.9692434Z 2025-05-07T20:26:12.9692439Z 2025-05-07T20:26:12.9692442Z 2025-05-07T20:26:12.9692446Z 2025-05-07T20:26:12.9692450Z 2025-05-07T20:26:12.9692454Z 2025-05-07T20:26:12.9693719Z 2025-05-07T20:26:13.0224442Z libnpp-12.3.1.54 | 93.4 MB | ######## | 81%  2025-05-07T20:26:13.0758778Z nsight-compute-2024. | 443.1 MB | #######5 | 76% 2025-05-07T20:26:13.0759063Z 2025-05-07T20:26:13.0759067Z 2025-05-07T20:26:13.0759070Z 2025-05-07T20:26:13.0759074Z 2025-05-07T20:26:13.0759078Z 2025-05-07T20:26:13.0759082Z 2025-05-07T20:26:13.0759086Z 2025-05-07T20:26:13.1226333Z libnpp-12.3.1.54 | 93.4 MB | ########4 | 84%  2025-05-07T20:26:13.1769860Z nsight-compute-2024. | 443.1 MB | #######6 | 77% 2025-05-07T20:26:13.1770281Z 2025-05-07T20:26:13.1770313Z 2025-05-07T20:26:13.1770319Z 2025-05-07T20:26:13.1770324Z 2025-05-07T20:26:13.1770329Z 2025-05-07T20:26:13.1770334Z 2025-05-07T20:26:13.1770382Z 2025-05-07T20:26:13.2247104Z libnpp-12.3.1.54 | 93.4 MB | ########7 | 88%  2025-05-07T20:26:13.2774118Z nsight-compute-2024. | 443.1 MB | #######7 | 77% 2025-05-07T20:26:13.2774393Z 2025-05-07T20:26:13.2774397Z 2025-05-07T20:26:13.2774401Z 2025-05-07T20:26:13.2774405Z 2025-05-07T20:26:13.2774424Z 2025-05-07T20:26:13.2774429Z 2025-05-07T20:26:13.2775125Z 2025-05-07T20:26:13.3250415Z libnpp-12.3.1.54 | 93.4 MB | #########1 | 92%  2025-05-07T20:26:13.3778566Z nsight-compute-2024. | 443.1 MB | #######8 | 78% 2025-05-07T20:26:13.3778879Z 2025-05-07T20:26:13.3778884Z 2025-05-07T20:26:13.3778889Z 2025-05-07T20:26:13.3778894Z 2025-05-07T20:26:13.3778898Z 2025-05-07T20:26:13.3778904Z 2025-05-07T20:26:13.3779474Z 2025-05-07T20:26:13.4357031Z libnpp-12.3.1.54 | 93.4 MB | #########5 | 96%  2025-05-07T20:26:13.4779633Z nsight-compute-2024. | 443.1 MB | #######9 | 79% 2025-05-07T20:26:13.4779903Z 2025-05-07T20:26:13.4779908Z 2025-05-07T20:26:13.4779911Z 2025-05-07T20:26:13.4779915Z 2025-05-07T20:26:13.4779919Z 2025-05-07T20:26:13.4779923Z 2025-05-07T20:26:13.4782326Z 2025-05-07T20:26:13.5356975Z libnpp-12.3.1.54 | 93.4 MB | #########9 | 100%  2025-05-07T20:26:13.6728766Z nsight-compute-2024. | 443.1 MB | ######## | 80% 2025-05-07T20:26:13.8523103Z nsight-compute-2024. | 443.1 MB | ######## | 81% 2025-05-07T20:26:13.9523263Z nsight-compute-2024. | 443.1 MB | ########1 | 82% 2025-05-07T20:26:14.0528367Z nsight-compute-2024. | 443.1 MB | ########2 | 83% 2025-05-07T20:26:14.1531679Z nsight-compute-2024. | 443.1 MB | ########3 | 84% 2025-05-07T20:26:14.2535994Z nsight-compute-2024. | 443.1 MB | ########4 | 84% 2025-05-07T20:26:14.3536108Z nsight-compute-2024. | 443.1 MB | ########5 | 85% 2025-05-07T20:26:14.4575253Z nsight-compute-2024. | 443.1 MB | ########6 | 86% 2025-05-07T20:26:14.5575671Z nsight-compute-2024. | 443.1 MB | ########7 | 87% 2025-05-07T20:26:14.6153183Z nsight-compute-2024. | 443.1 MB | ########8 | 88% 2025-05-07T20:26:14.6153463Z 2025-05-07T20:26:14.6153467Z 2025-05-07T20:26:14.6153471Z 2025-05-07T20:26:14.6153475Z 2025-05-07T20:26:14.6153480Z 2025-05-07T20:26:14.6153485Z 2025-05-07T20:26:14.6586065Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:26:14.6648453Z nsight-compute-2024. | 443.1 MB | ########8 | 89% 2025-05-07T20:26:14.6648729Z 2025-05-07T20:26:14.6649071Z 2025-05-07T20:26:14.6649075Z 2025-05-07T20:26:14.6649175Z 2025-05-07T20:26:14.6649218Z 2025-05-07T20:26:14.6649234Z 2025-05-07T20:26:14.6649240Z 2025-05-07T20:26:14.6649415Z 2025-05-07T20:26:14.7652795Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:26:14.7653228Z 2025-05-07T20:26:14.7653262Z 2025-05-07T20:26:14.7653513Z 2025-05-07T20:26:14.7653517Z 2025-05-07T20:26:14.7653533Z 2025-05-07T20:26:14.7653536Z 2025-05-07T20:26:14.7653540Z 2025-05-07T20:26:14.7657677Z 2025-05-07T20:26:14.7867763Z cuda-nvdisasm-12.6.7 | 47.6 MB | 6 | 6%  2025-05-07T20:26:14.8702539Z nsight-compute-2024. | 443.1 MB | ########9 | 90% 2025-05-07T20:26:14.8702874Z 2025-05-07T20:26:14.8703108Z 2025-05-07T20:26:14.8703118Z 2025-05-07T20:26:14.8703124Z 2025-05-07T20:26:14.8703129Z 2025-05-07T20:26:14.8703135Z 2025-05-07T20:26:14.8703141Z 2025-05-07T20:26:14.8709021Z 2025-05-07T20:26:14.9021868Z cuda-nvdisasm-12.6.7 | 47.6 MB | #2 | 12%  2025-05-07T20:26:14.9707903Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:26:14.9708325Z 2025-05-07T20:26:14.9708332Z 2025-05-07T20:26:14.9708337Z 2025-05-07T20:26:14.9708342Z 2025-05-07T20:26:14.9708348Z 2025-05-07T20:26:14.9708354Z 2025-05-07T20:26:14.9708359Z 2025-05-07T20:26:14.9714711Z 2025-05-07T20:26:15.0089681Z cuda-nvdisasm-12.6.7 | 47.6 MB | #9 | 19%  2025-05-07T20:26:15.0793370Z nsight-compute-2024. | 443.1 MB | #########1 | 91% 2025-05-07T20:26:15.0793816Z 2025-05-07T20:26:15.0793822Z 2025-05-07T20:26:15.0793827Z 2025-05-07T20:26:15.0793844Z 2025-05-07T20:26:15.0793849Z 2025-05-07T20:26:15.0793854Z 2025-05-07T20:26:15.0793860Z 2025-05-07T20:26:15.0798874Z 2025-05-07T20:26:15.0852932Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##5 | 25%  2025-05-07T20:26:15.0853343Z 2025-05-07T20:26:15.0853348Z 2025-05-07T20:26:15.0853353Z 2025-05-07T20:26:15.0853358Z 2025-05-07T20:26:15.0858165Z 2025-05-07T20:26:15.1387822Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:26:15.1388126Z 2025-05-07T20:26:15.1388130Z 2025-05-07T20:26:15.1388134Z 2025-05-07T20:26:15.1388138Z 2025-05-07T20:26:15.1388142Z 2025-05-07T20:26:15.1388146Z 2025-05-07T20:26:15.1388150Z 2025-05-07T20:26:15.1388398Z 2025-05-07T20:26:15.1388585Z 2025-05-07T20:26:15.1403155Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:26:15.1793413Z nsight-compute-2024. | 443.1 MB | #########2 | 92% 2025-05-07T20:26:15.1793707Z 2025-05-07T20:26:15.1793711Z 2025-05-07T20:26:15.1793715Z 2025-05-07T20:26:15.1793718Z 2025-05-07T20:26:15.1793722Z 2025-05-07T20:26:15.1793726Z 2025-05-07T20:26:15.1793729Z 2025-05-07T20:26:15.1793733Z 2025-05-07T20:26:15.2390682Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###2 | 33%  2025-05-07T20:26:15.2391016Z 2025-05-07T20:26:15.2391020Z 2025-05-07T20:26:15.2391024Z 2025-05-07T20:26:15.2391028Z 2025-05-07T20:26:15.2391033Z 2025-05-07T20:26:15.2391037Z 2025-05-07T20:26:15.2391042Z 2025-05-07T20:26:15.2391046Z 2025-05-07T20:26:15.2398835Z 2025-05-07T20:26:15.2408377Z libcurand-10.3.7.77 | 39.9 MB | 3 | 4%  2025-05-07T20:26:15.2797187Z nsight-compute-2024. | 443.1 MB | #########2 | 93% 2025-05-07T20:26:15.2797484Z 2025-05-07T20:26:15.2797488Z 2025-05-07T20:26:15.2797491Z 2025-05-07T20:26:15.2797495Z 2025-05-07T20:26:15.2797499Z 2025-05-07T20:26:15.2797503Z 2025-05-07T20:26:15.2797506Z 2025-05-07T20:26:15.2797510Z 2025-05-07T20:26:15.3404480Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###9 | 40%  2025-05-07T20:26:15.3404943Z 2025-05-07T20:26:15.3404952Z 2025-05-07T20:26:15.3404960Z 2025-05-07T20:26:15.3404968Z 2025-05-07T20:26:15.3404976Z 2025-05-07T20:26:15.3404984Z 2025-05-07T20:26:15.3404992Z 2025-05-07T20:26:15.3405000Z 2025-05-07T20:26:15.3407810Z 2025-05-07T20:26:15.3496186Z libcurand-10.3.7.77 | 39.9 MB | 9 | 10%  2025-05-07T20:26:15.3984278Z nsight-compute-2024. | 443.1 MB | #########3 | 94% 2025-05-07T20:26:15.3984555Z 2025-05-07T20:26:15.3984560Z 2025-05-07T20:26:15.3984565Z 2025-05-07T20:26:15.3984568Z 2025-05-07T20:26:15.3984573Z 2025-05-07T20:26:15.3984578Z 2025-05-07T20:26:15.3984891Z 2025-05-07T20:26:15.3984898Z 2025-05-07T20:26:15.4410918Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####6 | 46%  2025-05-07T20:26:15.4411236Z 2025-05-07T20:26:15.4411240Z 2025-05-07T20:26:15.4411244Z 2025-05-07T20:26:15.4411248Z 2025-05-07T20:26:15.4411251Z 2025-05-07T20:26:15.4411255Z 2025-05-07T20:26:15.4411259Z 2025-05-07T20:26:15.4411263Z 2025-05-07T20:26:15.4411273Z 2025-05-07T20:26:15.4615953Z libcurand-10.3.7.77 | 39.9 MB | #6 | 16%  2025-05-07T20:26:15.5142070Z nsight-compute-2024. | 443.1 MB | #########4 | 94% 2025-05-07T20:26:15.5142341Z 2025-05-07T20:26:15.5142621Z 2025-05-07T20:26:15.5142633Z 2025-05-07T20:26:15.5142687Z 2025-05-07T20:26:15.5142692Z 2025-05-07T20:26:15.5142697Z 2025-05-07T20:26:15.5142703Z 2025-05-07T20:26:15.5144209Z 2025-05-07T20:26:15.5411234Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####2 | 53%  2025-05-07T20:26:15.5411625Z 2025-05-07T20:26:15.5411652Z 2025-05-07T20:26:15.5411672Z 2025-05-07T20:26:15.5411677Z 2025-05-07T20:26:15.5411683Z 2025-05-07T20:26:15.5411696Z 2025-05-07T20:26:15.5411701Z 2025-05-07T20:26:15.5411705Z 2025-05-07T20:26:15.5413724Z 2025-05-07T20:26:15.5696103Z libcurand-10.3.7.77 | 39.9 MB | ##3 | 23%  2025-05-07T20:26:15.6238065Z nsight-compute-2024. | 443.1 MB | #########5 | 95% 2025-05-07T20:26:15.6238332Z 2025-05-07T20:26:15.6238337Z 2025-05-07T20:26:15.6238341Z 2025-05-07T20:26:15.6238347Z 2025-05-07T20:26:15.6238350Z 2025-05-07T20:26:15.6238354Z 2025-05-07T20:26:15.6238366Z 2025-05-07T20:26:15.6239873Z 2025-05-07T20:26:15.6416491Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####8 | 59%  2025-05-07T20:26:15.6416811Z 2025-05-07T20:26:15.6416816Z 2025-05-07T20:26:15.6416827Z 2025-05-07T20:26:15.6416831Z 2025-05-07T20:26:15.6416834Z 2025-05-07T20:26:15.6416839Z 2025-05-07T20:26:15.6416842Z 2025-05-07T20:26:15.6416847Z 2025-05-07T20:26:15.6420142Z 2025-05-07T20:26:15.6699766Z libcurand-10.3.7.77 | 39.9 MB | ##9 | 30%  2025-05-07T20:26:15.7245636Z nsight-compute-2024. | 443.1 MB | #########5 | 96% 2025-05-07T20:26:15.7245932Z 2025-05-07T20:26:15.7245936Z 2025-05-07T20:26:15.7245940Z 2025-05-07T20:26:15.7245944Z 2025-05-07T20:26:15.7245947Z 2025-05-07T20:26:15.7245951Z 2025-05-07T20:26:15.7245955Z 2025-05-07T20:26:15.7251826Z 2025-05-07T20:26:15.7422440Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######4 | 65%  2025-05-07T20:26:15.7422842Z 2025-05-07T20:26:15.7422848Z 2025-05-07T20:26:15.7422853Z 2025-05-07T20:26:15.7422858Z 2025-05-07T20:26:15.7422863Z 2025-05-07T20:26:15.7422868Z 2025-05-07T20:26:15.7422875Z 2025-05-07T20:26:15.7422881Z 2025-05-07T20:26:15.7424377Z 2025-05-07T20:26:15.7753840Z libcurand-10.3.7.77 | 39.9 MB | ###6 | 36%  2025-05-07T20:26:15.8246927Z nsight-compute-2024. | 443.1 MB | #########6 | 96% 2025-05-07T20:26:15.8247333Z 2025-05-07T20:26:15.8247354Z 2025-05-07T20:26:15.8247359Z 2025-05-07T20:26:15.8247364Z 2025-05-07T20:26:15.8247369Z 2025-05-07T20:26:15.8247375Z 2025-05-07T20:26:15.8247380Z 2025-05-07T20:26:15.8250191Z 2025-05-07T20:26:15.8428072Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######1 | 71%  2025-05-07T20:26:15.8428522Z 2025-05-07T20:26:15.8428527Z 2025-05-07T20:26:15.8428532Z 2025-05-07T20:26:15.8428537Z 2025-05-07T20:26:15.8428543Z 2025-05-07T20:26:15.8428548Z 2025-05-07T20:26:15.8428564Z 2025-05-07T20:26:15.8428571Z 2025-05-07T20:26:15.8429917Z 2025-05-07T20:26:15.8757497Z libcurand-10.3.7.77 | 39.9 MB | ####3 | 43%  2025-05-07T20:26:15.9379979Z nsight-compute-2024. | 443.1 MB | #########7 | 97% 2025-05-07T20:26:15.9380433Z 2025-05-07T20:26:15.9380440Z 2025-05-07T20:26:15.9380446Z 2025-05-07T20:26:15.9380451Z 2025-05-07T20:26:15.9380458Z 2025-05-07T20:26:15.9380464Z 2025-05-07T20:26:15.9380470Z 2025-05-07T20:26:15.9385263Z 2025-05-07T20:26:15.9430124Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######7 | 77%  2025-05-07T20:26:15.9430513Z 2025-05-07T20:26:15.9430518Z 2025-05-07T20:26:15.9430521Z 2025-05-07T20:26:15.9430525Z 2025-05-07T20:26:15.9430529Z 2025-05-07T20:26:15.9430541Z 2025-05-07T20:26:15.9430545Z 2025-05-07T20:26:15.9430548Z 2025-05-07T20:26:15.9431573Z 2025-05-07T20:26:15.9776892Z libcurand-10.3.7.77 | 39.9 MB | ##### | 51%  2025-05-07T20:26:16.0389823Z nsight-compute-2024. | 443.1 MB | #########7 | 98% 2025-05-07T20:26:16.0390169Z 2025-05-07T20:26:16.0390176Z 2025-05-07T20:26:16.0390181Z 2025-05-07T20:26:16.0390187Z 2025-05-07T20:26:16.0390192Z 2025-05-07T20:26:16.0390209Z 2025-05-07T20:26:16.0390215Z 2025-05-07T20:26:16.0391763Z 2025-05-07T20:26:16.0432676Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########4 | 84%  2025-05-07T20:26:16.0433136Z 2025-05-07T20:26:16.0433141Z 2025-05-07T20:26:16.0433145Z 2025-05-07T20:26:16.0433167Z 2025-05-07T20:26:16.0433183Z 2025-05-07T20:26:16.0433186Z 2025-05-07T20:26:16.0433190Z 2025-05-07T20:26:16.0433193Z 2025-05-07T20:26:16.0433197Z 2025-05-07T20:26:16.0784085Z libcurand-10.3.7.77 | 39.9 MB | #####7 | 58%  2025-05-07T20:26:16.1434665Z nsight-compute-2024. | 443.1 MB | #########8 | 99% 2025-05-07T20:26:16.1434940Z 2025-05-07T20:26:16.1435333Z 2025-05-07T20:26:16.1435339Z 2025-05-07T20:26:16.1435357Z 2025-05-07T20:26:16.1435360Z 2025-05-07T20:26:16.1435364Z 2025-05-07T20:26:16.1435368Z 2025-05-07T20:26:16.1435371Z 2025-05-07T20:26:16.1436924Z 2025-05-07T20:26:16.1457284Z libcurand-10.3.7.77 | 39.9 MB | ######5 | 65%  2025-05-07T20:26:16.1457603Z 2025-05-07T20:26:16.1457608Z 2025-05-07T20:26:16.1457612Z 2025-05-07T20:26:16.1457616Z 2025-05-07T20:26:16.1457620Z 2025-05-07T20:26:16.1457625Z 2025-05-07T20:26:16.1457630Z 2025-05-07T20:26:16.1461957Z 2025-05-07T20:26:16.1848065Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######### | 90%  2025-05-07T20:26:16.2436758Z nsight-compute-2024. | 443.1 MB | #########9 | 99% 2025-05-07T20:26:16.2437096Z 2025-05-07T20:26:16.2437102Z 2025-05-07T20:26:16.2437119Z 2025-05-07T20:26:16.2437124Z 2025-05-07T20:26:16.2437129Z 2025-05-07T20:26:16.2437135Z 2025-05-07T20:26:16.2437140Z 2025-05-07T20:26:16.2437145Z 2025-05-07T20:26:16.2442205Z 2025-05-07T20:26:16.2484418Z libcurand-10.3.7.77 | 39.9 MB | #######2 | 73%  2025-05-07T20:26:16.2484874Z 2025-05-07T20:26:16.2484880Z 2025-05-07T20:26:16.2484885Z 2025-05-07T20:26:16.2484891Z 2025-05-07T20:26:16.2484897Z 2025-05-07T20:26:16.2484902Z 2025-05-07T20:26:16.2484908Z 2025-05-07T20:26:16.2484913Z 2025-05-07T20:26:16.3438484Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########6 | 96%  2025-05-07T20:26:16.3438858Z 2025-05-07T20:26:16.3438862Z 2025-05-07T20:26:16.3438866Z 2025-05-07T20:26:16.3438870Z 2025-05-07T20:26:16.3438899Z 2025-05-07T20:26:16.3438918Z 2025-05-07T20:26:16.3438921Z 2025-05-07T20:26:16.3438925Z 2025-05-07T20:26:16.3440931Z 2025-05-07T20:26:16.3723930Z libcurand-10.3.7.77 | 39.9 MB | ######## | 81%  2025-05-07T20:26:16.3724241Z 2025-05-07T20:26:16.3724245Z 2025-05-07T20:26:16.3724249Z 2025-05-07T20:26:16.3724252Z 2025-05-07T20:26:16.3724256Z 2025-05-07T20:26:16.3724260Z 2025-05-07T20:26:16.3726547Z 2025-05-07T20:26:16.4159886Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:26:16.4160180Z 2025-05-07T20:26:16.4160184Z 2025-05-07T20:26:16.4160188Z 2025-05-07T20:26:16.4160200Z 2025-05-07T20:26:16.4160204Z 2025-05-07T20:26:16.4160208Z 2025-05-07T20:26:16.4160211Z 2025-05-07T20:26:16.4160215Z 2025-05-07T20:26:16.4160219Z 2025-05-07T20:26:16.4160389Z 2025-05-07T20:26:16.4439256Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:26:16.4439615Z 2025-05-07T20:26:16.4439620Z 2025-05-07T20:26:16.4439905Z 2025-05-07T20:26:16.4439911Z 2025-05-07T20:26:16.4439916Z 2025-05-07T20:26:16.4439921Z 2025-05-07T20:26:16.4439926Z 2025-05-07T20:26:16.4439931Z 2025-05-07T20:26:16.4442685Z 2025-05-07T20:26:16.5167192Z libcurand-10.3.7.77 | 39.9 MB | ########9 | 90%  2025-05-07T20:26:16.5167521Z 2025-05-07T20:26:16.5167525Z 2025-05-07T20:26:16.5167529Z 2025-05-07T20:26:16.5167533Z 2025-05-07T20:26:16.5167537Z 2025-05-07T20:26:16.5167541Z 2025-05-07T20:26:16.5167544Z 2025-05-07T20:26:16.5167548Z 2025-05-07T20:26:16.5167552Z 2025-05-07T20:26:16.5167775Z 2025-05-07T20:26:16.5439495Z gds-tools-1.11.1.6 | 37.8 MB | 8 | 9%  2025-05-07T20:26:16.5439804Z 2025-05-07T20:26:16.5439809Z 2025-05-07T20:26:16.5439813Z 2025-05-07T20:26:16.5439818Z 2025-05-07T20:26:16.5439822Z 2025-05-07T20:26:16.5439826Z 2025-05-07T20:26:16.5439830Z 2025-05-07T20:26:16.5439833Z 2025-05-07T20:26:16.5443277Z 2025-05-07T20:26:16.6167970Z libcurand-10.3.7.77 | 39.9 MB | #########8 | 98%  2025-05-07T20:26:16.6168297Z 2025-05-07T20:26:16.6168301Z 2025-05-07T20:26:16.6168313Z 2025-05-07T20:26:16.6168317Z 2025-05-07T20:26:16.6168320Z 2025-05-07T20:26:16.6168324Z 2025-05-07T20:26:16.6168327Z 2025-05-07T20:26:16.6168331Z 2025-05-07T20:26:16.6168335Z 2025-05-07T20:26:16.6168737Z 2025-05-07T20:26:16.7168669Z gds-tools-1.11.1.6 | 37.8 MB | #8 | 19%  2025-05-07T20:26:16.7169111Z 2025-05-07T20:26:16.7169117Z 2025-05-07T20:26:16.7169122Z 2025-05-07T20:26:16.7169127Z 2025-05-07T20:26:16.7169133Z 2025-05-07T20:26:16.7169138Z 2025-05-07T20:26:16.7169143Z 2025-05-07T20:26:16.7169148Z 2025-05-07T20:26:16.7169154Z 2025-05-07T20:26:16.7171474Z 2025-05-07T20:26:16.8174375Z gds-tools-1.11.1.6 | 37.8 MB | ##9 | 29%  2025-05-07T20:26:16.8174780Z 2025-05-07T20:26:16.8174785Z 2025-05-07T20:26:16.8174790Z 2025-05-07T20:26:16.8175060Z 2025-05-07T20:26:16.8175080Z 2025-05-07T20:26:16.8175084Z 2025-05-07T20:26:16.8175087Z 2025-05-07T20:26:16.8175091Z 2025-05-07T20:26:16.8175095Z 2025-05-07T20:26:16.8176809Z 2025-05-07T20:26:16.8316397Z gds-tools-1.11.1.6 | 37.8 MB | ###9 | 40%  2025-05-07T20:26:16.8316740Z 2025-05-07T20:26:16.8316744Z 2025-05-07T20:26:16.8316747Z 2025-05-07T20:26:16.9174997Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:26:16.9175388Z 2025-05-07T20:26:16.9175392Z 2025-05-07T20:26:16.9175396Z 2025-05-07T20:26:16.9175400Z 2025-05-07T20:26:16.9175404Z 2025-05-07T20:26:16.9175407Z 2025-05-07T20:26:16.9175411Z 2025-05-07T20:26:16.9175424Z 2025-05-07T20:26:16.9175427Z 2025-05-07T20:26:16.9175431Z 2025-05-07T20:26:17.0230982Z gds-tools-1.11.1.6 | 37.8 MB | ####9 | 50%  2025-05-07T20:26:17.0231300Z 2025-05-07T20:26:17.0231312Z 2025-05-07T20:26:17.0231315Z 2025-05-07T20:26:17.0231319Z 2025-05-07T20:26:17.0231344Z 2025-05-07T20:26:17.0231361Z 2025-05-07T20:26:17.0231364Z 2025-05-07T20:26:17.0231368Z 2025-05-07T20:26:17.0231371Z 2025-05-07T20:26:17.0231375Z 2025-05-07T20:26:17.1235000Z gds-tools-1.11.1.6 | 37.8 MB | ###### | 60%  2025-05-07T20:26:17.1235316Z 2025-05-07T20:26:17.1235320Z 2025-05-07T20:26:17.1235332Z 2025-05-07T20:26:17.1235336Z 2025-05-07T20:26:17.1235339Z 2025-05-07T20:26:17.1235343Z 2025-05-07T20:26:17.1235347Z 2025-05-07T20:26:17.1235350Z 2025-05-07T20:26:17.1235354Z 2025-05-07T20:26:17.1235357Z 2025-05-07T20:26:17.2274379Z gds-tools-1.11.1.6 | 37.8 MB | ####### | 71%  2025-05-07T20:26:17.2274680Z 2025-05-07T20:26:17.2274695Z 2025-05-07T20:26:17.2274699Z 2025-05-07T20:26:17.2274702Z 2025-05-07T20:26:17.2274706Z 2025-05-07T20:26:17.2274710Z 2025-05-07T20:26:17.2274714Z 2025-05-07T20:26:17.2274718Z 2025-05-07T20:26:17.2274722Z 2025-05-07T20:26:17.2276105Z 2025-05-07T20:26:17.3795950Z gds-tools-1.11.1.6 | 37.8 MB | ######## | 81%  2025-05-07T20:26:17.3796876Z 2025-05-07T20:26:17.4190352Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:26:17.4190653Z 2025-05-07T20:26:17.4190657Z 2025-05-07T20:26:17.4190661Z 2025-05-07T20:26:17.4190665Z 2025-05-07T20:26:17.4190669Z 2025-05-07T20:26:17.4190672Z 2025-05-07T20:26:17.4190676Z 2025-05-07T20:26:17.4190680Z 2025-05-07T20:26:17.4190684Z 2025-05-07T20:26:17.4190692Z 2025-05-07T20:26:17.4305541Z gds-tools-1.11.1.6 | 37.8 MB | #########1 | 91%  2025-05-07T20:26:17.4305978Z 2025-05-07T20:26:17.4305984Z 2025-05-07T20:26:17.4305989Z 2025-05-07T20:26:17.4305994Z 2025-05-07T20:26:17.4305999Z 2025-05-07T20:26:17.4306005Z 2025-05-07T20:26:17.4306010Z 2025-05-07T20:26:17.4306016Z 2025-05-07T20:26:17.4306021Z 2025-05-07T20:26:17.4306026Z 2025-05-07T20:26:17.4309796Z 2025-05-07T20:26:17.5197266Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:26:17.5197729Z 2025-05-07T20:26:17.5197733Z 2025-05-07T20:26:17.5197737Z 2025-05-07T20:26:17.5197741Z 2025-05-07T20:26:17.5197744Z 2025-05-07T20:26:17.5197748Z 2025-05-07T20:26:17.5197752Z 2025-05-07T20:26:17.5197755Z 2025-05-07T20:26:17.5197759Z 2025-05-07T20:26:17.5203612Z 2025-05-07T20:26:17.5308956Z gds-tools-1.11.1.6 | 37.8 MB | #########9 | 100%  2025-05-07T20:26:17.5309388Z 2025-05-07T20:26:17.5309392Z 2025-05-07T20:26:17.5309396Z 2025-05-07T20:26:17.5309400Z 2025-05-07T20:26:17.5309403Z 2025-05-07T20:26:17.5309407Z 2025-05-07T20:26:17.5309411Z 2025-05-07T20:26:17.5309414Z 2025-05-07T20:26:17.5309418Z 2025-05-07T20:26:17.5309422Z 2025-05-07T20:26:17.5309426Z 2025-05-07T20:26:17.6311817Z cuda-nvcc-tools-12.6 | 23.0 MB | #4 | 15%  2025-05-07T20:26:17.6312156Z 2025-05-07T20:26:17.6312160Z 2025-05-07T20:26:17.6312164Z 2025-05-07T20:26:17.6312168Z 2025-05-07T20:26:17.6312443Z 2025-05-07T20:26:17.6312459Z 2025-05-07T20:26:17.6312463Z 2025-05-07T20:26:17.6312468Z 2025-05-07T20:26:17.6312471Z 2025-05-07T20:26:17.6312475Z 2025-05-07T20:26:17.6312530Z 2025-05-07T20:26:17.7338053Z cuda-nvcc-tools-12.6 | 23.0 MB | ###2 | 33%  2025-05-07T20:26:17.7338398Z 2025-05-07T20:26:17.7338402Z 2025-05-07T20:26:17.7338415Z 2025-05-07T20:26:17.7338419Z 2025-05-07T20:26:17.7338423Z 2025-05-07T20:26:17.7338426Z 2025-05-07T20:26:17.7338430Z 2025-05-07T20:26:17.7338434Z 2025-05-07T20:26:17.7338438Z 2025-05-07T20:26:17.7338442Z 2025-05-07T20:26:17.7339002Z 2025-05-07T20:26:17.8339345Z cuda-nvcc-tools-12.6 | 23.0 MB | ####9 | 50%  2025-05-07T20:26:17.8339793Z 2025-05-07T20:26:17.8339799Z 2025-05-07T20:26:17.8339804Z 2025-05-07T20:26:17.8339809Z 2025-05-07T20:26:17.8339815Z 2025-05-07T20:26:17.8339822Z 2025-05-07T20:26:17.8339829Z 2025-05-07T20:26:17.8339833Z 2025-05-07T20:26:17.8339866Z 2025-05-07T20:26:17.8339885Z 2025-05-07T20:26:17.8339890Z 2025-05-07T20:26:17.8661125Z cuda-nvcc-tools-12.6 | 23.0 MB | ######7 | 67%  2025-05-07T20:26:17.8661463Z 2025-05-07T20:26:17.8661467Z 2025-05-07T20:26:17.8661471Z 2025-05-07T20:26:17.8661475Z 2025-05-07T20:26:17.8661479Z 2025-05-07T20:26:17.8661482Z 2025-05-07T20:26:17.8661486Z 2025-05-07T20:26:17.8661490Z 2025-05-07T20:26:17.8662943Z 2025-05-07T20:26:17.8912891Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:26:17.8913207Z 2025-05-07T20:26:17.8913211Z 2025-05-07T20:26:17.8913215Z 2025-05-07T20:26:17.8913219Z 2025-05-07T20:26:17.8913222Z 2025-05-07T20:26:17.8913226Z 2025-05-07T20:26:17.8913239Z 2025-05-07T20:26:17.8913242Z 2025-05-07T20:26:17.9315941Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:26:17.9316384Z 2025-05-07T20:26:17.9316390Z 2025-05-07T20:26:17.9316407Z 2025-05-07T20:26:17.9316442Z 2025-05-07T20:26:17.9316696Z 2025-05-07T20:26:17.9316700Z 2025-05-07T20:26:17.9316703Z 2025-05-07T20:26:17.9316707Z 2025-05-07T20:26:17.9316711Z 2025-05-07T20:26:17.9316714Z 2025-05-07T20:26:17.9316718Z 2025-05-07T20:26:17.9320140Z 2025-05-07T20:26:17.9338939Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:26:17.9339367Z 2025-05-07T20:26:17.9339371Z 2025-05-07T20:26:17.9339375Z 2025-05-07T20:26:17.9339379Z 2025-05-07T20:26:17.9339383Z 2025-05-07T20:26:17.9339386Z 2025-05-07T20:26:17.9339390Z 2025-05-07T20:26:17.9339394Z 2025-05-07T20:26:17.9339398Z 2025-05-07T20:26:17.9339402Z 2025-05-07T20:26:17.9340100Z 2025-05-07T20:26:17.9517025Z cuda-nvcc-tools-12.6 | 23.0 MB | ########7 | 87%  2025-05-07T20:26:17.9517351Z 2025-05-07T20:26:17.9517355Z 2025-05-07T20:26:17.9517359Z 2025-05-07T20:26:17.9517363Z 2025-05-07T20:26:17.9517367Z 2025-05-07T20:26:17.9517370Z 2025-05-07T20:26:17.9517393Z 2025-05-07T20:26:17.9517404Z 2025-05-07T20:26:17.9517408Z 2025-05-07T20:26:17.9517418Z 2025-05-07T20:26:17.9517422Z 2025-05-07T20:26:17.9517425Z 2025-05-07T20:26:17.9520545Z 2025-05-07T20:26:18.0317518Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:26:18.0317960Z 2025-05-07T20:26:18.0317966Z 2025-05-07T20:26:18.0317971Z 2025-05-07T20:26:18.0317976Z 2025-05-07T20:26:18.0317991Z 2025-05-07T20:26:18.0317996Z 2025-05-07T20:26:18.0318001Z 2025-05-07T20:26:18.0318006Z 2025-05-07T20:26:18.0318011Z 2025-05-07T20:26:18.0318016Z 2025-05-07T20:26:18.0318022Z 2025-05-07T20:26:18.0319642Z 2025-05-07T20:26:18.0520469Z cuda-nvrtc-12.6.85 | 17.3 MB | #8 | 19%  2025-05-07T20:26:18.0520873Z 2025-05-07T20:26:18.0520879Z 2025-05-07T20:26:18.0520884Z 2025-05-07T20:26:18.0520889Z 2025-05-07T20:26:18.0520895Z 2025-05-07T20:26:18.0520900Z 2025-05-07T20:26:18.0520905Z 2025-05-07T20:26:18.0520913Z 2025-05-07T20:26:18.0521193Z 2025-05-07T20:26:18.0521201Z 2025-05-07T20:26:18.0521206Z 2025-05-07T20:26:18.0521211Z 2025-05-07T20:26:18.0523126Z 2025-05-07T20:26:18.1318232Z libnvjitlink-12.6.85 | 14.9 MB | #9 | 19%  2025-05-07T20:26:18.1318609Z 2025-05-07T20:26:18.1318617Z 2025-05-07T20:26:18.1318622Z 2025-05-07T20:26:18.1318628Z 2025-05-07T20:26:18.1318635Z 2025-05-07T20:26:18.1318641Z 2025-05-07T20:26:18.1318647Z 2025-05-07T20:26:18.1318653Z 2025-05-07T20:26:18.1318658Z 2025-05-07T20:26:18.1318663Z 2025-05-07T20:26:18.1318668Z 2025-05-07T20:26:18.1321534Z 2025-05-07T20:26:18.1521182Z cuda-nvrtc-12.6.85 | 17.3 MB | ###8 | 39%  2025-05-07T20:26:18.1521649Z 2025-05-07T20:26:18.1521655Z 2025-05-07T20:26:18.1521660Z 2025-05-07T20:26:18.1521666Z 2025-05-07T20:26:18.1521671Z 2025-05-07T20:26:18.1521676Z 2025-05-07T20:26:18.1521680Z 2025-05-07T20:26:18.1521684Z 2025-05-07T20:26:18.1521688Z 2025-05-07T20:26:18.1521745Z 2025-05-07T20:26:18.1521750Z 2025-05-07T20:26:18.1521756Z 2025-05-07T20:26:18.1521762Z 2025-05-07T20:26:18.2320214Z libnvjitlink-12.6.85 | 14.9 MB | ####1 | 41%  2025-05-07T20:26:18.2320574Z 2025-05-07T20:26:18.2320578Z 2025-05-07T20:26:18.2320582Z 2025-05-07T20:26:18.2320585Z 2025-05-07T20:26:18.2320589Z 2025-05-07T20:26:18.2320593Z 2025-05-07T20:26:18.2320597Z 2025-05-07T20:26:18.2320601Z 2025-05-07T20:26:18.2320605Z 2025-05-07T20:26:18.2320609Z 2025-05-07T20:26:18.2320613Z 2025-05-07T20:26:18.2322117Z 2025-05-07T20:26:18.2559917Z cuda-nvrtc-12.6.85 | 17.3 MB | #####9 | 60%  2025-05-07T20:26:18.2560242Z 2025-05-07T20:26:18.2560248Z 2025-05-07T20:26:18.2560254Z 2025-05-07T20:26:18.2560259Z 2025-05-07T20:26:18.2560264Z 2025-05-07T20:26:18.2560270Z 2025-05-07T20:26:18.2560274Z 2025-05-07T20:26:18.2560279Z 2025-05-07T20:26:18.2560284Z 2025-05-07T20:26:18.2560288Z 2025-05-07T20:26:18.2560326Z 2025-05-07T20:26:18.2560622Z 2025-05-07T20:26:18.2560645Z 2025-05-07T20:26:18.2590902Z libnvjitlink-12.6.85 | 14.9 MB | ######2 | 62%  2025-05-07T20:26:18.2591219Z 2025-05-07T20:26:18.2591224Z 2025-05-07T20:26:18.3425044Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:26:18.3425311Z 2025-05-07T20:26:18.3425547Z 2025-05-07T20:26:18.3425559Z 2025-05-07T20:26:18.3429480Z 2025-05-07T20:26:18.3429486Z 2025-05-07T20:26:18.3429492Z 2025-05-07T20:26:18.3429498Z 2025-05-07T20:26:18.3429504Z 2025-05-07T20:26:18.3429554Z 2025-05-07T20:26:18.3429560Z 2025-05-07T20:26:18.3429566Z 2025-05-07T20:26:18.3429571Z 2025-05-07T20:26:18.3600729Z cuda-nvrtc-12.6.85 | 17.3 MB | ######## | 80%  2025-05-07T20:26:18.3601105Z 2025-05-07T20:26:18.3601111Z 2025-05-07T20:26:18.3601117Z 2025-05-07T20:26:18.3601124Z 2025-05-07T20:26:18.3601130Z 2025-05-07T20:26:18.3601135Z 2025-05-07T20:26:18.3601178Z 2025-05-07T20:26:18.3601206Z 2025-05-07T20:26:18.3601211Z 2025-05-07T20:26:18.3601217Z 2025-05-07T20:26:18.3601223Z 2025-05-07T20:26:18.3601227Z 2025-05-07T20:26:18.3601232Z 2025-05-07T20:26:18.4443786Z libnvjitlink-12.6.85 | 14.9 MB | ########2 | 83%  2025-05-07T20:26:18.4444147Z 2025-05-07T20:26:18.4444151Z 2025-05-07T20:26:18.4444156Z 2025-05-07T20:26:18.4444159Z 2025-05-07T20:26:18.4444163Z 2025-05-07T20:26:18.4444168Z 2025-05-07T20:26:18.4444171Z 2025-05-07T20:26:18.4444175Z 2025-05-07T20:26:18.4444179Z 2025-05-07T20:26:18.4444182Z 2025-05-07T20:26:18.4444186Z 2025-05-07T20:26:18.4444419Z 2025-05-07T20:26:18.7513975Z cuda-nvrtc-12.6.85 | 17.3 MB | #########9 | 100%  2025-05-07T20:26:18.7514412Z 2025-05-07T20:26:18.7514418Z 2025-05-07T20:26:18.7514423Z 2025-05-07T20:26:18.7514428Z 2025-05-07T20:26:18.7514433Z 2025-05-07T20:26:18.7514450Z 2025-05-07T20:26:18.7514457Z 2025-05-07T20:26:18.7514774Z 2025-05-07T20:26:18.7514796Z 2025-05-07T20:26:18.7514802Z 2025-05-07T20:26:18.7517555Z 2025-05-07T20:26:18.7942126Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:26:18.7942463Z 2025-05-07T20:26:18.7942468Z 2025-05-07T20:26:18.7942474Z 2025-05-07T20:26:18.7942480Z 2025-05-07T20:26:18.7942485Z 2025-05-07T20:26:18.7942495Z 2025-05-07T20:26:18.7942503Z 2025-05-07T20:26:18.7942512Z 2025-05-07T20:26:18.7942520Z 2025-05-07T20:26:18.7942528Z 2025-05-07T20:26:18.8063053Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:26:18.8063377Z 2025-05-07T20:26:18.8063383Z 2025-05-07T20:26:18.8063389Z 2025-05-07T20:26:18.8063394Z 2025-05-07T20:26:18.8063408Z 2025-05-07T20:26:18.8063412Z 2025-05-07T20:26:18.8063418Z 2025-05-07T20:26:18.8063423Z 2025-05-07T20:26:18.8063428Z 2025-05-07T20:26:18.8063434Z 2025-05-07T20:26:18.8063438Z 2025-05-07T20:26:18.8063443Z 2025-05-07T20:26:18.8063475Z 2025-05-07T20:26:18.8063494Z 2025-05-07T20:26:18.8430527Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:26:18.8430990Z 2025-05-07T20:26:18.8430997Z 2025-05-07T20:26:18.8431002Z 2025-05-07T20:26:18.8431008Z 2025-05-07T20:26:18.8431014Z 2025-05-07T20:26:18.8431019Z 2025-05-07T20:26:18.8431025Z 2025-05-07T20:26:18.8431031Z 2025-05-07T20:26:18.8431036Z 2025-05-07T20:26:18.8431042Z 2025-05-07T20:26:18.8431047Z 2025-05-07T20:26:18.8431052Z 2025-05-07T20:26:18.8431057Z 2025-05-07T20:26:18.8431060Z 2025-05-07T20:26:18.8436572Z 2025-05-07T20:26:18.8969699Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:26:18.8970035Z 2025-05-07T20:26:18.8970039Z 2025-05-07T20:26:18.8970043Z 2025-05-07T20:26:18.8970047Z 2025-05-07T20:26:18.8970051Z 2025-05-07T20:26:18.8970055Z 2025-05-07T20:26:18.8970058Z 2025-05-07T20:26:18.8970062Z 2025-05-07T20:26:18.8970066Z 2025-05-07T20:26:18.8970103Z 2025-05-07T20:26:18.8970641Z 2025-05-07T20:26:18.8970645Z 2025-05-07T20:26:18.8973614Z 2025-05-07T20:26:18.9068649Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:26:18.9069130Z 2025-05-07T20:26:18.9069134Z 2025-05-07T20:26:18.9069138Z 2025-05-07T20:26:18.9069141Z 2025-05-07T20:26:18.9069145Z 2025-05-07T20:26:18.9069149Z 2025-05-07T20:26:18.9069153Z 2025-05-07T20:26:18.9069156Z 2025-05-07T20:26:18.9069160Z 2025-05-07T20:26:18.9069164Z 2025-05-07T20:26:18.9069167Z 2025-05-07T20:26:18.9069171Z 2025-05-07T20:26:18.9069175Z 2025-05-07T20:26:18.9069179Z 2025-05-07T20:26:18.9308503Z cuda-nvcc-dev_linux- | 10.8 MB | ##6 | 26%  2025-05-07T20:26:18.9308845Z 2025-05-07T20:26:18.9308849Z 2025-05-07T20:26:18.9308853Z 2025-05-07T20:26:18.9308856Z 2025-05-07T20:26:18.9308860Z 2025-05-07T20:26:18.9308864Z 2025-05-07T20:26:18.9308868Z 2025-05-07T20:26:18.9308871Z 2025-05-07T20:26:18.9308900Z 2025-05-07T20:26:18.9308915Z 2025-05-07T20:26:18.9308918Z 2025-05-07T20:26:18.9308922Z 2025-05-07T20:26:18.9308925Z 2025-05-07T20:26:18.9308935Z 2025-05-07T20:26:18.9308939Z 2025-05-07T20:26:18.9309778Z 2025-05-07T20:26:18.9431340Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:26:18.9431735Z 2025-05-07T20:26:18.9431741Z 2025-05-07T20:26:18.9431746Z 2025-05-07T20:26:18.9431751Z 2025-05-07T20:26:18.9431756Z 2025-05-07T20:26:18.9431761Z 2025-05-07T20:26:18.9431766Z 2025-05-07T20:26:18.9431772Z 2025-05-07T20:26:18.9431777Z 2025-05-07T20:26:18.9431782Z 2025-05-07T20:26:18.9431787Z 2025-05-07T20:26:18.9431792Z 2025-05-07T20:26:18.9431797Z 2025-05-07T20:26:18.9431802Z 2025-05-07T20:26:18.9434641Z 2025-05-07T20:26:18.9960739Z cuda-nvvm-tools-12.6 | 10.4 MB | ### | 31%  2025-05-07T20:26:18.9961083Z 2025-05-07T20:26:18.9961087Z 2025-05-07T20:26:18.9961091Z 2025-05-07T20:26:18.9961384Z 2025-05-07T20:26:18.9961389Z 2025-05-07T20:26:18.9961393Z 2025-05-07T20:26:18.9961396Z 2025-05-07T20:26:18.9961400Z 2025-05-07T20:26:18.9961404Z 2025-05-07T20:26:18.9961407Z 2025-05-07T20:26:18.9961411Z 2025-05-07T20:26:18.9962756Z 2025-05-07T20:26:19.0167629Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:26:19.0167947Z 2025-05-07T20:26:19.0167951Z 2025-05-07T20:26:19.0167955Z 2025-05-07T20:26:19.0167967Z 2025-05-07T20:26:19.0167971Z 2025-05-07T20:26:19.0167975Z 2025-05-07T20:26:19.0167978Z 2025-05-07T20:26:19.0167982Z 2025-05-07T20:26:19.0167985Z 2025-05-07T20:26:19.0167989Z 2025-05-07T20:26:19.0167993Z 2025-05-07T20:26:19.0167996Z 2025-05-07T20:26:19.0168000Z 2025-05-07T20:26:19.0168004Z 2025-05-07T20:26:19.0310011Z cuda-nvcc-dev_linux- | 10.8 MB | #####2 | 52%  2025-05-07T20:26:19.0310348Z 2025-05-07T20:26:19.0310352Z 2025-05-07T20:26:19.0310355Z 2025-05-07T20:26:19.0310384Z 2025-05-07T20:26:19.0310388Z 2025-05-07T20:26:19.0310391Z 2025-05-07T20:26:19.0310395Z 2025-05-07T20:26:19.0310399Z 2025-05-07T20:26:19.0310402Z 2025-05-07T20:26:19.0310406Z 2025-05-07T20:26:19.0310409Z 2025-05-07T20:26:19.0310413Z 2025-05-07T20:26:19.0310417Z 2025-05-07T20:26:19.0310420Z 2025-05-07T20:26:19.0310424Z 2025-05-07T20:26:19.0310435Z 2025-05-07T20:26:19.0583285Z cuda-sanitizer-api-1 | 8.9 MB | ##9 | 29%  2025-05-07T20:26:19.0583631Z 2025-05-07T20:26:19.0583635Z 2025-05-07T20:26:19.0583639Z 2025-05-07T20:26:19.0583643Z 2025-05-07T20:26:19.0583646Z 2025-05-07T20:26:19.0583650Z 2025-05-07T20:26:19.0583654Z 2025-05-07T20:26:19.0583658Z 2025-05-07T20:26:19.0583661Z 2025-05-07T20:26:19.0583665Z 2025-05-07T20:26:19.0583669Z 2025-05-07T20:26:19.0583673Z 2025-05-07T20:26:19.0583676Z 2025-05-07T20:26:19.0583686Z 2025-05-07T20:26:19.0583690Z 2025-05-07T20:26:19.0657323Z cuda-nvvm-tools-12.6 | 10.4 MB | ###### | 61%  2025-05-07T20:26:19.0657902Z 2025-05-07T20:26:19.0657907Z 2025-05-07T20:26:19.0657918Z 2025-05-07T20:26:19.0657922Z 2025-05-07T20:26:19.0657925Z 2025-05-07T20:26:19.0657929Z 2025-05-07T20:26:19.0657932Z 2025-05-07T20:26:19.0657936Z 2025-05-07T20:26:19.0657940Z 2025-05-07T20:26:19.0657943Z 2025-05-07T20:26:19.0657947Z 2025-05-07T20:26:19.0657951Z 2025-05-07T20:26:19.0657954Z 2025-05-07T20:26:19.0657958Z 2025-05-07T20:26:19.0657962Z 2025-05-07T20:26:19.0657965Z 2025-05-07T20:26:19.0659114Z 2025-05-07T20:26:19.1316204Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:26:19.1316553Z 2025-05-07T20:26:19.1316558Z 2025-05-07T20:26:19.1316563Z 2025-05-07T20:26:19.1316568Z 2025-05-07T20:26:19.1316589Z 2025-05-07T20:26:19.1316595Z 2025-05-07T20:26:19.1316601Z 2025-05-07T20:26:19.1316605Z 2025-05-07T20:26:19.1316609Z 2025-05-07T20:26:19.1316614Z 2025-05-07T20:26:19.1316646Z 2025-05-07T20:26:19.1316662Z 2025-05-07T20:26:19.1316666Z 2025-05-07T20:26:19.1316670Z 2025-05-07T20:26:19.1316674Z 2025-05-07T20:26:19.1316678Z 2025-05-07T20:26:19.1334681Z cuda-sanitizer-api-1 | 8.9 MB | #####9 | 59%  2025-05-07T20:26:19.1335030Z 2025-05-07T20:26:19.1335034Z 2025-05-07T20:26:19.1335038Z 2025-05-07T20:26:19.1335041Z 2025-05-07T20:26:19.1335046Z 2025-05-07T20:26:19.1335052Z 2025-05-07T20:26:19.1335057Z 2025-05-07T20:26:19.1335061Z 2025-05-07T20:26:19.1335065Z 2025-05-07T20:26:19.1335068Z 2025-05-07T20:26:19.1335072Z 2025-05-07T20:26:19.1335076Z 2025-05-07T20:26:19.1335079Z 2025-05-07T20:26:19.1335083Z 2025-05-07T20:26:19.1657819Z cuda-nvcc-dev_linux- | 10.8 MB | #######6 | 77%  2025-05-07T20:26:19.1658267Z 2025-05-07T20:26:19.1658271Z 2025-05-07T20:26:19.1658275Z 2025-05-07T20:26:19.1658279Z 2025-05-07T20:26:19.1658282Z 2025-05-07T20:26:19.1658516Z 2025-05-07T20:26:19.1658533Z 2025-05-07T20:26:19.1658536Z 2025-05-07T20:26:19.1658540Z 2025-05-07T20:26:19.1658543Z 2025-05-07T20:26:19.1658547Z 2025-05-07T20:26:19.1658551Z 2025-05-07T20:26:19.1658554Z 2025-05-07T20:26:19.1658558Z 2025-05-07T20:26:19.1658561Z 2025-05-07T20:26:19.1658565Z 2025-05-07T20:26:19.1663577Z 2025-05-07T20:26:19.1922384Z cuda-nvvm-impl-12.6. | 7.7 MB | ###4 | 35%  2025-05-07T20:26:19.1922784Z 2025-05-07T20:26:19.1922788Z 2025-05-07T20:26:19.1922792Z 2025-05-07T20:26:19.1922795Z 2025-05-07T20:26:19.1922799Z 2025-05-07T20:26:19.1922803Z 2025-05-07T20:26:19.1922806Z 2025-05-07T20:26:19.1922810Z 2025-05-07T20:26:19.1922814Z 2025-05-07T20:26:19.1922817Z 2025-05-07T20:26:19.1922821Z 2025-05-07T20:26:19.1922825Z 2025-05-07T20:26:19.1922828Z 2025-05-07T20:26:19.1922832Z 2025-05-07T20:26:19.1922836Z 2025-05-07T20:26:19.2341696Z cuda-nvvm-tools-12.6 | 10.4 MB | ########9 | 89%  2025-05-07T20:26:19.2342032Z 2025-05-07T20:26:19.2342036Z 2025-05-07T20:26:19.2342039Z 2025-05-07T20:26:19.2342043Z 2025-05-07T20:26:19.2342055Z 2025-05-07T20:26:19.2342059Z 2025-05-07T20:26:19.2342063Z 2025-05-07T20:26:19.2342066Z 2025-05-07T20:26:19.2342070Z 2025-05-07T20:26:19.2342074Z 2025-05-07T20:26:19.2342077Z 2025-05-07T20:26:19.2342081Z 2025-05-07T20:26:19.2342085Z 2025-05-07T20:26:19.2342088Z 2025-05-07T20:26:19.2342092Z 2025-05-07T20:26:19.2344136Z 2025-05-07T20:26:19.2374998Z cuda-sanitizer-api-1 | 8.9 MB | ########8 | 89%  2025-05-07T20:26:19.2375345Z 2025-05-07T20:26:19.2375349Z 2025-05-07T20:26:19.2375352Z 2025-05-07T20:26:19.2375356Z 2025-05-07T20:26:19.2375360Z 2025-05-07T20:26:19.2375363Z 2025-05-07T20:26:19.2375367Z 2025-05-07T20:26:19.2375370Z 2025-05-07T20:26:19.2375374Z 2025-05-07T20:26:19.2375378Z 2025-05-07T20:26:19.2375381Z 2025-05-07T20:26:19.2375385Z 2025-05-07T20:26:19.2375397Z 2025-05-07T20:26:19.2375613Z 2025-05-07T20:26:19.2699362Z cuda-nvcc-dev_linux- | 10.8 MB | #########9 | 100%  2025-05-07T20:26:19.2699807Z 2025-05-07T20:26:19.2699813Z 2025-05-07T20:26:19.2699818Z 2025-05-07T20:26:19.2699833Z 2025-05-07T20:26:19.2699838Z 2025-05-07T20:26:19.2699843Z 2025-05-07T20:26:19.2699848Z 2025-05-07T20:26:19.2699853Z 2025-05-07T20:26:19.2699859Z 2025-05-07T20:26:19.2699864Z 2025-05-07T20:26:19.2699869Z 2025-05-07T20:26:19.2699874Z 2025-05-07T20:26:19.2699879Z 2025-05-07T20:26:19.2699885Z 2025-05-07T20:26:19.2699889Z 2025-05-07T20:26:19.2699892Z 2025-05-07T20:26:19.2699896Z 2025-05-07T20:26:19.5920279Z cuda-nvvm-impl-12.6. | 7.7 MB | ######9 | 69%  2025-05-07T20:26:19.5920635Z 2025-05-07T20:26:19.5920639Z 2025-05-07T20:26:19.5920644Z 2025-05-07T20:26:19.5920647Z 2025-05-07T20:26:19.5920652Z 2025-05-07T20:26:19.5920657Z 2025-05-07T20:26:19.5920661Z 2025-05-07T20:26:19.5920686Z 2025-05-07T20:26:19.5920704Z 2025-05-07T20:26:19.5920707Z 2025-05-07T20:26:19.5920711Z 2025-05-07T20:26:19.5920715Z 2025-05-07T20:26:19.5920718Z 2025-05-07T20:26:19.5920722Z 2025-05-07T20:26:19.5920726Z 2025-05-07T20:26:19.5921396Z 2025-05-07T20:26:19.6118362Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:26:19.6118693Z 2025-05-07T20:26:19.6118697Z 2025-05-07T20:26:19.6118701Z 2025-05-07T20:26:19.6118705Z 2025-05-07T20:26:19.6118708Z 2025-05-07T20:26:19.6118712Z 2025-05-07T20:26:19.6118716Z 2025-05-07T20:26:19.6118719Z 2025-05-07T20:26:19.6118723Z 2025-05-07T20:26:19.6118727Z 2025-05-07T20:26:19.6118730Z 2025-05-07T20:26:19.6118741Z 2025-05-07T20:26:19.6118745Z 2025-05-07T20:26:19.6118749Z 2025-05-07T20:26:19.6120275Z 2025-05-07T20:26:19.6230029Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:26:19.6230358Z 2025-05-07T20:26:19.6230572Z 2025-05-07T20:26:19.6230586Z 2025-05-07T20:26:19.6230589Z 2025-05-07T20:26:19.6230593Z 2025-05-07T20:26:19.6230597Z 2025-05-07T20:26:19.6230600Z 2025-05-07T20:26:19.6230604Z 2025-05-07T20:26:19.6230608Z 2025-05-07T20:26:19.6230611Z 2025-05-07T20:26:19.6230615Z 2025-05-07T20:26:19.6230619Z 2025-05-07T20:26:19.6230622Z 2025-05-07T20:26:19.6230626Z 2025-05-07T20:26:19.6230629Z 2025-05-07T20:26:19.6230633Z 2025-05-07T20:26:19.6230637Z 2025-05-07T20:26:19.6301450Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:26:19.6301858Z 2025-05-07T20:26:19.6301863Z 2025-05-07T20:26:19.6301866Z 2025-05-07T20:26:19.6301870Z 2025-05-07T20:26:19.6301874Z 2025-05-07T20:26:19.6301877Z 2025-05-07T20:26:19.6301881Z 2025-05-07T20:26:19.6301884Z 2025-05-07T20:26:19.6301895Z 2025-05-07T20:26:19.6301898Z 2025-05-07T20:26:19.6301902Z 2025-05-07T20:26:19.6301906Z 2025-05-07T20:26:19.6301909Z 2025-05-07T20:26:19.6301913Z 2025-05-07T20:26:19.6301926Z 2025-05-07T20:26:19.6301934Z 2025-05-07T20:26:19.6301938Z 2025-05-07T20:26:19.6302484Z 2025-05-07T20:26:19.6452211Z libglib-2.84.0 | 3.8 MB | | 0%  2025-05-07T20:26:19.6452739Z 2025-05-07T20:26:19.6452750Z 2025-05-07T20:26:19.6452760Z 2025-05-07T20:26:19.6452770Z 2025-05-07T20:26:19.6452779Z 2025-05-07T20:26:19.6452788Z 2025-05-07T20:26:19.6452796Z 2025-05-07T20:26:19.6452805Z 2025-05-07T20:26:19.6452811Z 2025-05-07T20:26:19.6452819Z 2025-05-07T20:26:19.6452826Z 2025-05-07T20:26:19.6452833Z 2025-05-07T20:26:19.6452838Z 2025-05-07T20:26:19.6454086Z 2025-05-07T20:26:19.6658502Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:26:19.6658891Z 2025-05-07T20:26:19.6658898Z 2025-05-07T20:26:19.6658903Z 2025-05-07T20:26:19.6658909Z 2025-05-07T20:26:19.6658914Z 2025-05-07T20:26:19.6658919Z 2025-05-07T20:26:19.6658925Z 2025-05-07T20:26:19.6658930Z 2025-05-07T20:26:19.6659560Z 2025-05-07T20:26:19.6659581Z 2025-05-07T20:26:19.6659587Z 2025-05-07T20:26:19.6659592Z 2025-05-07T20:26:19.6659598Z 2025-05-07T20:26:19.6659603Z 2025-05-07T20:26:19.6659607Z 2025-05-07T20:26:19.6659612Z 2025-05-07T20:26:19.6659617Z 2025-05-07T20:26:19.6659622Z 2025-05-07T20:26:19.6661108Z 2025-05-07T20:26:19.7305108Z ... (more hidden) ... 2025-05-07T20:26:19.7305592Z 2025-05-07T20:26:19.7305597Z 2025-05-07T20:26:19.7305601Z 2025-05-07T20:26:19.7305605Z 2025-05-07T20:26:19.7305608Z 2025-05-07T20:26:19.7305613Z 2025-05-07T20:26:19.7305618Z 2025-05-07T20:26:19.7305621Z 2025-05-07T20:26:19.7305625Z 2025-05-07T20:26:19.7305629Z 2025-05-07T20:26:19.7305633Z 2025-05-07T20:26:19.7305636Z 2025-05-07T20:26:19.7305640Z 2025-05-07T20:26:19.7305644Z 2025-05-07T20:26:19.7305648Z 2025-05-07T20:26:19.7305651Z 2025-05-07T20:26:19.7305655Z 2025-05-07T20:26:19.7320363Z 2025-05-07T20:26:19.7659177Z libglib-2.84.0 | 3.8 MB | ########5 | 86%  2025-05-07T20:26:19.7659787Z 2025-05-07T20:26:19.7659792Z 2025-05-07T20:26:19.7659795Z 2025-05-07T20:26:19.7659799Z 2025-05-07T20:26:19.7659803Z 2025-05-07T20:26:19.7659806Z 2025-05-07T20:26:19.7659810Z 2025-05-07T20:26:19.7659814Z 2025-05-07T20:26:19.7659817Z 2025-05-07T20:26:19.7659821Z 2025-05-07T20:26:19.7659824Z 2025-05-07T20:26:19.7659828Z 2025-05-07T20:26:19.7659832Z 2025-05-07T20:26:19.7659835Z 2025-05-07T20:26:19.7659839Z 2025-05-07T20:26:19.7659843Z 2025-05-07T20:26:19.7659846Z 2025-05-07T20:26:19.7659850Z 2025-05-07T20:26:19.7659853Z 2025-05-07T20:26:19.8753423Z ... (more hidden) ... 2025-05-07T20:26:19.8753741Z 2025-05-07T20:26:19.8753746Z 2025-05-07T20:26:19.8753750Z 2025-05-07T20:26:19.8753754Z 2025-05-07T20:26:19.8753758Z 2025-05-07T20:26:19.8753763Z 2025-05-07T20:26:19.8753768Z 2025-05-07T20:26:19.8753773Z 2025-05-07T20:26:19.8754095Z 2025-05-07T20:26:19.8754118Z 2025-05-07T20:26:19.8754123Z 2025-05-07T20:26:19.8754128Z 2025-05-07T20:26:19.8754134Z 2025-05-07T20:26:19.8754139Z 2025-05-07T20:26:19.8754143Z 2025-05-07T20:26:19.8754149Z 2025-05-07T20:26:19.8754154Z 2025-05-07T20:26:19.8758653Z 2025-05-07T20:26:19.9036718Z libglib-2.84.0 | 3.8 MB | ########## | 100%  2025-05-07T20:26:19.9037062Z 2025-05-07T20:26:19.9037066Z 2025-05-07T20:26:19.9037070Z 2025-05-07T20:26:19.9037074Z 2025-05-07T20:26:19.9037077Z 2025-05-07T20:26:19.9037081Z 2025-05-07T20:26:19.9037085Z 2025-05-07T20:26:19.9037089Z 2025-05-07T20:26:19.9037092Z 2025-05-07T20:26:19.9037096Z 2025-05-07T20:26:19.9037100Z 2025-05-07T20:26:19.9037103Z 2025-05-07T20:26:19.9037107Z 2025-05-07T20:26:19.9037111Z 2025-05-07T20:26:19.9037115Z 2025-05-07T20:26:19.9037118Z 2025-05-07T20:26:19.9037128Z 2025-05-07T20:26:19.9037132Z 2025-05-07T20:26:19.9037136Z 2025-05-07T20:26:21.1363441Z ... (more hidden) ... 2025-05-07T20:26:21.1363770Z 2025-05-07T20:26:21.1363774Z 2025-05-07T20:26:21.1363778Z 2025-05-07T20:26:21.1363782Z 2025-05-07T20:26:21.1363786Z 2025-05-07T20:26:21.1364984Z 2025-05-07T20:26:21.9644673Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:26:21.9645104Z 2025-05-07T20:26:21.9645110Z 2025-05-07T20:26:21.9645115Z 2025-05-07T20:26:21.9645120Z 2025-05-07T20:26:21.9645125Z 2025-05-07T20:26:22.4272614Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:26:22.4272929Z 2025-05-07T20:26:22.4272934Z 2025-05-07T20:26:22.4272938Z 2025-05-07T20:26:22.4272941Z 2025-05-07T20:26:22.4272945Z 2025-05-07T20:26:22.4272949Z 2025-05-07T20:26:22.4273292Z 2025-05-07T20:26:22.5731973Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:26:22.5732342Z 2025-05-07T20:26:22.5732347Z 2025-05-07T20:26:22.5732351Z 2025-05-07T20:26:22.5732379Z 2025-05-07T20:26:22.5732676Z 2025-05-07T20:26:22.5732680Z 2025-05-07T20:26:22.5732684Z 2025-05-07T20:26:22.5732687Z 2025-05-07T20:26:22.5734414Z 2025-05-07T20:26:22.8076770Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:26:22.8077103Z 2025-05-07T20:26:22.8077107Z 2025-05-07T20:26:22.8077111Z 2025-05-07T20:26:22.8077115Z 2025-05-07T20:26:22.8077119Z 2025-05-07T20:26:22.8077122Z 2025-05-07T20:26:22.8077126Z 2025-05-07T20:26:22.8079018Z 2025-05-07T20:26:23.0564877Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:26:23.1923818Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:26:23.1924098Z 2025-05-07T20:26:23.1924414Z 2025-05-07T20:26:23.1924426Z 2025-05-07T20:26:23.1924499Z 2025-05-07T20:26:23.1924505Z 2025-05-07T20:26:23.1924511Z 2025-05-07T20:26:23.1924516Z 2025-05-07T20:26:23.1924522Z 2025-05-07T20:26:23.1924532Z 2025-05-07T20:26:23.1924537Z 2025-05-07T20:26:23.2386162Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:26:23.2386495Z 2025-05-07T20:26:23.2386499Z 2025-05-07T20:26:23.2386503Z 2025-05-07T20:26:23.2386507Z 2025-05-07T20:26:23.2386510Z 2025-05-07T20:26:23.2386514Z 2025-05-07T20:26:23.2386518Z 2025-05-07T20:26:23.2386523Z 2025-05-07T20:26:23.2386526Z 2025-05-07T20:26:23.2386530Z 2025-05-07T20:26:23.2386540Z 2025-05-07T20:26:23.3981844Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:26:23.3982191Z 2025-05-07T20:26:23.3982195Z 2025-05-07T20:26:23.3982199Z 2025-05-07T20:26:23.3982203Z 2025-05-07T20:26:23.3982207Z 2025-05-07T20:26:23.3982211Z 2025-05-07T20:26:23.3982214Z 2025-05-07T20:26:23.3982218Z 2025-05-07T20:26:23.3982222Z 2025-05-07T20:26:23.3982225Z 2025-05-07T20:26:23.3982229Z 2025-05-07T20:26:23.3982233Z 2025-05-07T20:26:23.3982237Z 2025-05-07T20:26:23.4990881Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:26:23.4991252Z 2025-05-07T20:26:23.4991256Z 2025-05-07T20:26:23.4991260Z 2025-05-07T20:26:23.4991264Z 2025-05-07T20:26:23.4991276Z 2025-05-07T20:26:23.4991280Z 2025-05-07T20:26:23.4991283Z 2025-05-07T20:26:23.4991287Z 2025-05-07T20:26:23.4991291Z 2025-05-07T20:26:23.4991294Z 2025-05-07T20:26:23.4991298Z 2025-05-07T20:26:23.4991890Z 2025-05-07T20:26:23.5854504Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:26:23.5854831Z 2025-05-07T20:26:23.5854835Z 2025-05-07T20:26:23.5854839Z 2025-05-07T20:26:23.5854842Z 2025-05-07T20:26:23.5854846Z 2025-05-07T20:26:23.5854850Z 2025-05-07T20:26:23.5854853Z 2025-05-07T20:26:23.5854869Z 2025-05-07T20:26:23.5854872Z 2025-05-07T20:26:23.5854876Z 2025-05-07T20:26:23.5854879Z 2025-05-07T20:26:23.5854883Z 2025-05-07T20:26:23.5854887Z 2025-05-07T20:26:23.5854891Z 2025-05-07T20:26:23.5854895Z 2025-05-07T20:26:23.5854898Z 2025-05-07T20:26:23.6638869Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:26:23.6639261Z 2025-05-07T20:26:23.6639265Z 2025-05-07T20:26:23.6639268Z 2025-05-07T20:26:23.6639272Z 2025-05-07T20:26:23.6639276Z 2025-05-07T20:26:23.6639279Z 2025-05-07T20:26:23.6639283Z 2025-05-07T20:26:23.6639286Z 2025-05-07T20:26:23.6639290Z 2025-05-07T20:26:23.6639294Z 2025-05-07T20:26:23.6639297Z 2025-05-07T20:26:23.6639301Z 2025-05-07T20:26:23.6639304Z 2025-05-07T20:26:23.6639308Z 2025-05-07T20:26:23.6639311Z 2025-05-07T20:26:23.7070309Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:26:23.7070705Z 2025-05-07T20:26:23.7070710Z 2025-05-07T20:26:23.7070713Z 2025-05-07T20:26:23.7070717Z 2025-05-07T20:26:23.7070721Z 2025-05-07T20:26:23.7070725Z 2025-05-07T20:26:23.7070729Z 2025-05-07T20:26:23.7070734Z 2025-05-07T20:26:23.7070739Z 2025-05-07T20:26:23.7070742Z 2025-05-07T20:26:23.7070746Z 2025-05-07T20:26:23.7070775Z 2025-05-07T20:26:23.7071006Z 2025-05-07T20:26:23.7071010Z 2025-05-07T20:26:23.7071014Z 2025-05-07T20:26:23.7071017Z 2025-05-07T20:26:23.7071021Z 2025-05-07T20:26:23.9060949Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:26:23.9061293Z 2025-05-07T20:26:23.9061297Z 2025-05-07T20:26:23.9061301Z 2025-05-07T20:26:23.9061305Z 2025-05-07T20:26:23.9061309Z 2025-05-07T20:26:23.9061321Z 2025-05-07T20:26:23.9061325Z 2025-05-07T20:26:23.9061329Z 2025-05-07T20:26:23.9061332Z 2025-05-07T20:26:23.9061336Z 2025-05-07T20:26:23.9061340Z 2025-05-07T20:26:23.9061343Z 2025-05-07T20:26:23.9061347Z 2025-05-07T20:26:23.9061351Z 2025-05-07T20:26:23.9061354Z 2025-05-07T20:26:23.9061366Z 2025-05-07T20:26:23.9061370Z 2025-05-07T20:26:23.9061373Z 2025-05-07T20:26:23.9472493Z libglib-2.84.0 | 3.8 MB | ########## | 100%  2025-05-07T20:26:23.9472853Z 2025-05-07T20:26:23.9472892Z 2025-05-07T20:26:23.9472912Z 2025-05-07T20:26:23.9472916Z 2025-05-07T20:26:23.9472920Z 2025-05-07T20:26:23.9472923Z 2025-05-07T20:26:23.9472927Z 2025-05-07T20:26:23.9472930Z 2025-05-07T20:26:23.9472934Z 2025-05-07T20:26:23.9472938Z 2025-05-07T20:26:23.9472941Z 2025-05-07T20:26:23.9472945Z 2025-05-07T20:26:23.9472948Z 2025-05-07T20:26:23.9472952Z 2025-05-07T20:26:24.1298731Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:26:24.1299087Z 2025-05-07T20:26:24.1299091Z 2025-05-07T20:26:24.1299095Z 2025-05-07T20:26:24.1299099Z 2025-05-07T20:26:24.1299104Z 2025-05-07T20:26:24.1299108Z 2025-05-07T20:26:24.1299111Z 2025-05-07T20:26:24.1299115Z 2025-05-07T20:26:24.1299133Z 2025-05-07T20:26:24.1299137Z 2025-05-07T20:26:24.1299141Z 2025-05-07T20:26:24.1299144Z 2025-05-07T20:26:24.1299148Z 2025-05-07T20:26:24.1299152Z 2025-05-07T20:26:24.1299156Z 2025-05-07T20:26:24.1299160Z 2025-05-07T20:26:24.1299163Z 2025-05-07T20:26:24.1299407Z 2025-05-07T20:26:24.1299427Z 2025-05-07T20:26:25.7375591Z ... (more hidden) ... 2025-05-07T20:26:25.7375908Z 2025-05-07T20:26:30.1984766Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:26:30.1992307Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:26:30.1992656Z 2025-05-07T20:26:30.1992662Z 2025-05-07T20:26:30.1992668Z 2025-05-07T20:26:30.1992673Z 2025-05-07T20:26:30.1992678Z 2025-05-07T20:26:30.1992683Z 2025-05-07T20:26:30.1992689Z 2025-05-07T20:26:30.1992694Z 2025-05-07T20:26:30.1992699Z 2025-05-07T20:26:30.1992705Z 2025-05-07T20:26:30.1992710Z 2025-05-07T20:26:30.1992715Z 2025-05-07T20:26:30.1992720Z 2025-05-07T20:26:30.1992735Z 2025-05-07T20:26:30.1992740Z 2025-05-07T20:26:30.1992745Z 2025-05-07T20:26:30.1992750Z 2025-05-07T20:26:30.1992754Z 2025-05-07T20:26:30.1992759Z 2025-05-07T20:26:30.1992881Z 2025-05-07T20:26:30.1993351Z  2025-05-07T20:26:30.1993760Z 2025-05-07T20:26:30.1993971Z 2025-05-07T20:26:30.1994214Z  2025-05-07T20:26:30.1994427Z 2025-05-07T20:26:30.1994431Z 2025-05-07T20:26:30.1994618Z  2025-05-07T20:26:30.1994894Z 2025-05-07T20:26:30.1994900Z 2025-05-07T20:26:30.1994914Z 2025-05-07T20:26:30.1995160Z  2025-05-07T20:26:30.1995443Z 2025-05-07T20:26:30.1995448Z 2025-05-07T20:26:30.1995453Z 2025-05-07T20:26:30.1995459Z 2025-05-07T20:26:30.1995712Z  2025-05-07T20:26:30.1995994Z 2025-05-07T20:26:30.1996000Z 2025-05-07T20:26:30.1996005Z 2025-05-07T20:26:30.1996009Z 2025-05-07T20:26:30.1996014Z 2025-05-07T20:26:30.1996266Z  2025-05-07T20:26:30.1996888Z 2025-05-07T20:26:30.1996894Z 2025-05-07T20:26:30.1996899Z 2025-05-07T20:26:30.1996916Z 2025-05-07T20:26:30.1996921Z 2025-05-07T20:26:30.1996926Z 2025-05-07T20:26:30.1997184Z  2025-05-07T20:26:30.1997471Z 2025-05-07T20:26:30.1997476Z 2025-05-07T20:26:30.1997481Z 2025-05-07T20:26:30.1997486Z 2025-05-07T20:26:30.1997491Z 2025-05-07T20:26:30.1997496Z 2025-05-07T20:26:30.1997501Z 2025-05-07T20:26:30.1997768Z  2025-05-07T20:26:30.1998062Z 2025-05-07T20:26:30.1998067Z 2025-05-07T20:26:30.1998072Z 2025-05-07T20:26:30.1998077Z 2025-05-07T20:26:30.1998082Z 2025-05-07T20:26:30.1998096Z 2025-05-07T20:26:30.1998101Z 2025-05-07T20:26:30.1998106Z 2025-05-07T20:26:30.1998362Z  2025-05-07T20:26:30.1998671Z 2025-05-07T20:26:30.1998676Z 2025-05-07T20:26:30.1998689Z 2025-05-07T20:26:30.1998703Z 2025-05-07T20:26:30.1998708Z 2025-05-07T20:26:30.1998713Z 2025-05-07T20:26:30.1998718Z 2025-05-07T20:26:30.1998723Z 2025-05-07T20:26:30.1998728Z 2025-05-07T20:26:30.1998993Z  2025-05-07T20:26:30.1999300Z 2025-05-07T20:26:30.1999306Z 2025-05-07T20:26:30.1999310Z 2025-05-07T20:26:30.1999316Z 2025-05-07T20:26:30.1999321Z 2025-05-07T20:26:30.1999326Z 2025-05-07T20:26:30.1999331Z 2025-05-07T20:26:30.1999336Z 2025-05-07T20:26:30.1999341Z 2025-05-07T20:26:30.1999346Z 2025-05-07T20:26:30.1999618Z  2025-05-07T20:26:30.1999937Z 2025-05-07T20:26:30.1999943Z 2025-05-07T20:26:30.1999948Z 2025-05-07T20:26:30.1999953Z 2025-05-07T20:26:30.1999958Z 2025-05-07T20:26:30.1999963Z 2025-05-07T20:26:30.1999968Z 2025-05-07T20:26:30.1999973Z 2025-05-07T20:26:30.1999979Z 2025-05-07T20:26:30.1999984Z 2025-05-07T20:26:30.2000164Z 2025-05-07T20:26:30.2000472Z  2025-05-07T20:26:30.2000784Z 2025-05-07T20:26:30.2000790Z 2025-05-07T20:26:30.2000795Z 2025-05-07T20:26:30.2000800Z 2025-05-07T20:26:30.2000805Z 2025-05-07T20:26:30.2000810Z 2025-05-07T20:26:30.2000823Z 2025-05-07T20:26:30.2000828Z 2025-05-07T20:26:30.2000833Z 2025-05-07T20:26:30.2000838Z 2025-05-07T20:26:30.2000843Z 2025-05-07T20:26:30.2000849Z 2025-05-07T20:26:30.2001132Z  2025-05-07T20:26:30.2001444Z 2025-05-07T20:26:30.2001449Z 2025-05-07T20:26:30.2001454Z 2025-05-07T20:26:30.2001459Z 2025-05-07T20:26:30.2001464Z 2025-05-07T20:26:30.2001469Z 2025-05-07T20:26:30.2001474Z 2025-05-07T20:26:30.2001479Z 2025-05-07T20:26:30.2001485Z 2025-05-07T20:26:30.2001490Z 2025-05-07T20:26:30.2001495Z 2025-05-07T20:26:30.2001500Z 2025-05-07T20:26:30.2001505Z 2025-05-07T20:26:30.2001796Z  2025-05-07T20:26:30.2002111Z 2025-05-07T20:26:30.2002116Z 2025-05-07T20:26:30.2002122Z 2025-05-07T20:26:30.2002127Z 2025-05-07T20:26:30.2002132Z 2025-05-07T20:26:30.2002137Z 2025-05-07T20:26:30.2002142Z 2025-05-07T20:26:30.2002147Z 2025-05-07T20:26:30.2002151Z 2025-05-07T20:26:30.2002157Z 2025-05-07T20:26:30.2002162Z 2025-05-07T20:26:30.2002167Z 2025-05-07T20:26:30.2002172Z 2025-05-07T20:26:30.2002177Z 2025-05-07T20:26:30.2002483Z  2025-05-07T20:26:30.2002800Z 2025-05-07T20:26:30.2002806Z 2025-05-07T20:26:30.2002811Z 2025-05-07T20:26:30.2002816Z 2025-05-07T20:26:30.2002820Z 2025-05-07T20:26:30.2002832Z 2025-05-07T20:26:30.2002837Z 2025-05-07T20:26:30.2002842Z 2025-05-07T20:26:30.2002847Z 2025-05-07T20:26:30.2002852Z 2025-05-07T20:26:30.2002857Z 2025-05-07T20:26:30.2002862Z 2025-05-07T20:26:30.2002873Z 2025-05-07T20:26:30.2003013Z 2025-05-07T20:26:30.2003018Z 2025-05-07T20:26:30.2003300Z  2025-05-07T20:26:30.2003662Z 2025-05-07T20:26:30.2003668Z 2025-05-07T20:26:30.2003672Z 2025-05-07T20:26:30.2003678Z 2025-05-07T20:26:30.2003683Z 2025-05-07T20:26:30.2003688Z 2025-05-07T20:26:30.2003693Z 2025-05-07T20:26:30.2003698Z 2025-05-07T20:26:30.2003703Z 2025-05-07T20:26:30.2003708Z 2025-05-07T20:26:30.2003713Z 2025-05-07T20:26:30.2003718Z 2025-05-07T20:26:30.2003723Z 2025-05-07T20:26:30.2003728Z 2025-05-07T20:26:30.2003733Z 2025-05-07T20:26:30.2003738Z 2025-05-07T20:26:30.2004039Z  2025-05-07T20:26:30.2004364Z 2025-05-07T20:26:30.2004368Z 2025-05-07T20:26:30.2004372Z 2025-05-07T20:26:30.2004376Z 2025-05-07T20:26:30.2004379Z 2025-05-07T20:26:30.2004383Z 2025-05-07T20:26:30.2004393Z 2025-05-07T20:26:30.2004407Z 2025-05-07T20:26:30.2004411Z 2025-05-07T20:26:30.2004415Z 2025-05-07T20:26:30.2004418Z 2025-05-07T20:26:30.2004422Z 2025-05-07T20:26:30.2004425Z 2025-05-07T20:26:30.2004429Z 2025-05-07T20:26:30.2004432Z 2025-05-07T20:26:30.2004436Z 2025-05-07T20:26:30.2004440Z 2025-05-07T20:26:30.2004690Z  2025-05-07T20:26:30.2005028Z 2025-05-07T20:26:30.2005034Z 2025-05-07T20:26:30.2005039Z 2025-05-07T20:26:30.2005044Z 2025-05-07T20:26:30.2005050Z 2025-05-07T20:26:30.2005055Z 2025-05-07T20:26:30.2005061Z 2025-05-07T20:26:30.2005066Z 2025-05-07T20:26:30.2005071Z 2025-05-07T20:26:30.2005076Z 2025-05-07T20:26:30.2005081Z 2025-05-07T20:26:30.2005085Z 2025-05-07T20:26:30.2005090Z 2025-05-07T20:26:30.2005095Z 2025-05-07T20:26:30.2005100Z 2025-05-07T20:26:30.2005105Z 2025-05-07T20:26:30.2005110Z 2025-05-07T20:26:30.2005115Z 2025-05-07T20:26:30.2005584Z  2025-05-07T20:26:30.2005960Z 2025-05-07T20:26:30.2005965Z 2025-05-07T20:26:30.2006122Z  2025-05-07T20:26:30.2006265Z 2025-05-07T20:26:30.2006271Z 2025-05-07T20:26:30.2006414Z  2025-05-07T20:26:30.2006571Z 2025-05-07T20:26:30.2006575Z 2025-05-07T20:26:30.2006581Z 2025-05-07T20:26:30.2006734Z  2025-05-07T20:26:30.2006897Z 2025-05-07T20:26:30.2006903Z 2025-05-07T20:26:30.2006908Z 2025-05-07T20:26:30.2006913Z 2025-05-07T20:26:30.2007077Z  2025-05-07T20:26:30.2007245Z 2025-05-07T20:26:30.2007251Z 2025-05-07T20:26:30.2007256Z 2025-05-07T20:26:30.2007261Z 2025-05-07T20:26:30.2007266Z 2025-05-07T20:26:30.2007539Z  2025-05-07T20:26:30.2007714Z 2025-05-07T20:26:30.2007728Z 2025-05-07T20:26:30.2007733Z 2025-05-07T20:26:30.2007743Z 2025-05-07T20:26:30.2007748Z 2025-05-07T20:26:30.2007753Z 2025-05-07T20:26:30.2008214Z  2025-05-07T20:26:30.2008409Z 2025-05-07T20:26:30.2008427Z 2025-05-07T20:26:30.2008438Z 2025-05-07T20:26:30.2008443Z 2025-05-07T20:26:30.2008448Z 2025-05-07T20:26:30.2008454Z 2025-05-07T20:26:30.2008472Z 2025-05-07T20:26:30.2008825Z  2025-05-07T20:26:30.2008998Z 2025-05-07T20:26:30.2009002Z 2025-05-07T20:26:30.2009006Z 2025-05-07T20:26:30.2009010Z 2025-05-07T20:26:30.2009013Z 2025-05-07T20:26:30.2009017Z 2025-05-07T20:26:30.2009021Z 2025-05-07T20:26:30.2009031Z 2025-05-07T20:26:30.2009416Z  2025-05-07T20:26:30.2009634Z 2025-05-07T20:26:30.2009648Z 2025-05-07T20:26:30.2009663Z 2025-05-07T20:26:30.2009668Z 2025-05-07T20:26:30.2009673Z 2025-05-07T20:26:30.2009678Z 2025-05-07T20:26:30.2009683Z 2025-05-07T20:26:30.2009689Z 2025-05-07T20:26:30.2009694Z 2025-05-07T20:26:30.2010010Z  2025-05-07T20:26:30.2010233Z 2025-05-07T20:26:30.2010247Z 2025-05-07T20:26:30.2010252Z 2025-05-07T20:26:30.2010257Z 2025-05-07T20:26:30.2010262Z 2025-05-07T20:26:30.2010267Z 2025-05-07T20:26:30.2010411Z 2025-05-07T20:26:30.2010416Z 2025-05-07T20:26:30.2010421Z 2025-05-07T20:26:30.2010427Z 2025-05-07T20:26:30.2010639Z  2025-05-07T20:26:30.2010881Z 2025-05-07T20:26:30.2010886Z 2025-05-07T20:26:30.2010891Z 2025-05-07T20:26:30.2010896Z 2025-05-07T20:26:30.2010915Z 2025-05-07T20:26:30.2010920Z 2025-05-07T20:26:30.2010925Z 2025-05-07T20:26:30.2010930Z 2025-05-07T20:26:30.2010934Z 2025-05-07T20:26:30.2010939Z 2025-05-07T20:26:30.2010944Z 2025-05-07T20:26:30.2011141Z  2025-05-07T20:26:30.2011377Z 2025-05-07T20:26:30.2011382Z 2025-05-07T20:26:30.2011388Z 2025-05-07T20:26:30.2011392Z 2025-05-07T20:26:30.2011404Z 2025-05-07T20:26:30.2011409Z 2025-05-07T20:26:30.2011414Z 2025-05-07T20:26:30.2011419Z 2025-05-07T20:26:30.2011425Z 2025-05-07T20:26:30.2011429Z 2025-05-07T20:26:30.2011443Z 2025-05-07T20:26:30.2011448Z 2025-05-07T20:26:30.2011825Z  2025-05-07T20:26:30.2012099Z 2025-05-07T20:26:30.2012123Z 2025-05-07T20:26:30.2012129Z 2025-05-07T20:26:30.2012134Z 2025-05-07T20:26:30.2012140Z 2025-05-07T20:26:30.2012145Z 2025-05-07T20:26:30.2012150Z 2025-05-07T20:26:30.2012155Z 2025-05-07T20:26:30.2012170Z 2025-05-07T20:26:30.2012175Z 2025-05-07T20:26:30.2012180Z 2025-05-07T20:26:30.2012186Z 2025-05-07T20:26:30.2012190Z 2025-05-07T20:26:30.2012398Z  2025-05-07T20:26:30.2012657Z 2025-05-07T20:26:30.2012670Z 2025-05-07T20:26:30.2012684Z 2025-05-07T20:26:30.2012689Z 2025-05-07T20:26:30.2012694Z 2025-05-07T20:26:30.2012699Z 2025-05-07T20:26:30.2012704Z 2025-05-07T20:26:30.2012709Z 2025-05-07T20:26:30.2012714Z 2025-05-07T20:26:30.2012719Z 2025-05-07T20:26:30.2012724Z 2025-05-07T20:26:30.2012729Z 2025-05-07T20:26:30.2012734Z 2025-05-07T20:26:30.2012739Z 2025-05-07T20:26:30.2013202Z  2025-05-07T20:26:30.2013488Z 2025-05-07T20:26:30.2013493Z 2025-05-07T20:26:30.2013498Z 2025-05-07T20:26:30.2013661Z 2025-05-07T20:26:30.2013676Z 2025-05-07T20:26:30.2013681Z 2025-05-07T20:26:30.2013698Z 2025-05-07T20:26:30.2013703Z 2025-05-07T20:26:30.2013708Z 2025-05-07T20:26:30.2013728Z 2025-05-07T20:26:30.2013733Z 2025-05-07T20:26:30.2013738Z 2025-05-07T20:26:30.2013743Z 2025-05-07T20:26:30.2013747Z 2025-05-07T20:26:30.2013752Z 2025-05-07T20:26:30.2013978Z  2025-05-07T20:26:30.2014268Z 2025-05-07T20:26:30.2014274Z 2025-05-07T20:26:30.2014279Z 2025-05-07T20:26:30.2014285Z 2025-05-07T20:26:30.2014290Z 2025-05-07T20:26:30.2014295Z 2025-05-07T20:26:30.2014300Z 2025-05-07T20:26:30.2014305Z 2025-05-07T20:26:30.2014310Z 2025-05-07T20:26:30.2014315Z 2025-05-07T20:26:30.2014320Z 2025-05-07T20:26:30.2014326Z 2025-05-07T20:26:30.2014331Z 2025-05-07T20:26:30.2014336Z 2025-05-07T20:26:30.2014341Z 2025-05-07T20:26:30.2014346Z 2025-05-07T20:26:30.2014596Z  2025-05-07T20:26:30.2014880Z 2025-05-07T20:26:30.2014894Z 2025-05-07T20:26:30.2014906Z 2025-05-07T20:26:30.2014912Z 2025-05-07T20:26:30.2014917Z 2025-05-07T20:26:30.2014922Z 2025-05-07T20:26:30.2014927Z 2025-05-07T20:26:30.2014932Z 2025-05-07T20:26:30.2014949Z 2025-05-07T20:26:30.2014955Z 2025-05-07T20:26:30.2014960Z 2025-05-07T20:26:30.2014965Z 2025-05-07T20:26:30.2014970Z 2025-05-07T20:26:30.2014975Z 2025-05-07T20:26:30.2014980Z 2025-05-07T20:26:30.2014985Z 2025-05-07T20:26:30.2014990Z 2025-05-07T20:26:30.2015218Z  2025-05-07T20:26:30.2015500Z 2025-05-07T20:26:30.2015506Z 2025-05-07T20:26:30.2015511Z 2025-05-07T20:26:30.2015517Z 2025-05-07T20:26:30.2015523Z 2025-05-07T20:26:30.2015530Z 2025-05-07T20:26:30.2015535Z 2025-05-07T20:26:30.2015540Z 2025-05-07T20:26:30.2015545Z 2025-05-07T20:26:30.2015550Z 2025-05-07T20:26:30.2015555Z 2025-05-07T20:26:30.2015560Z 2025-05-07T20:26:30.2015565Z 2025-05-07T20:26:30.2015570Z 2025-05-07T20:26:30.2015575Z 2025-05-07T20:26:30.2015586Z 2025-05-07T20:26:30.2015732Z 2025-05-07T20:26:30.2015737Z 2025-05-07T20:26:30.2016248Z  2025-05-07T20:26:30.2016458Z 2025-05-07T20:26:30.2016466Z 2025-05-07T20:26:30.2017316Z  2025-05-07T20:26:30.2017454Z 2025-05-07T20:26:30.2017458Z 2025-05-07T20:26:30.2017580Z  2025-05-07T20:26:30.2017700Z 2025-05-07T20:26:30.2017704Z 2025-05-07T20:26:30.2017708Z 2025-05-07T20:26:30.2018028Z  2025-05-07T20:26:30.2018143Z 2025-05-07T20:26:30.2018153Z 2025-05-07T20:26:30.2018157Z 2025-05-07T20:26:30.2018161Z 2025-05-07T20:26:30.2018666Z  2025-05-07T20:26:30.2018843Z 2025-05-07T20:26:30.2018849Z 2025-05-07T20:26:30.2018854Z 2025-05-07T20:26:30.2018860Z 2025-05-07T20:26:30.2018884Z 2025-05-07T20:26:30.2019032Z  2025-05-07T20:26:30.2019199Z 2025-05-07T20:26:30.2019211Z 2025-05-07T20:26:30.2019216Z 2025-05-07T20:26:30.2019221Z 2025-05-07T20:26:30.2019226Z 2025-05-07T20:26:30.2019231Z 2025-05-07T20:26:30.2019461Z  2025-05-07T20:26:30.2019622Z 2025-05-07T20:26:30.2019627Z 2025-05-07T20:26:30.2019633Z 2025-05-07T20:26:30.2019639Z 2025-05-07T20:26:30.2019644Z 2025-05-07T20:26:30.2019649Z 2025-05-07T20:26:30.2019654Z 2025-05-07T20:26:30.2020019Z  2025-05-07T20:26:30.2020164Z 2025-05-07T20:26:30.2020168Z 2025-05-07T20:26:30.2020179Z 2025-05-07T20:26:30.2020182Z 2025-05-07T20:26:30.2020186Z 2025-05-07T20:26:30.2020190Z 2025-05-07T20:26:30.2020193Z 2025-05-07T20:26:30.2020197Z 2025-05-07T20:26:30.2020620Z  2025-05-07T20:26:30.2020840Z 2025-05-07T20:26:30.2020845Z 2025-05-07T20:26:30.2020851Z 2025-05-07T20:26:30.2020881Z 2025-05-07T20:26:30.2020886Z 2025-05-07T20:26:30.2020891Z 2025-05-07T20:26:30.2020896Z 2025-05-07T20:26:30.2020900Z 2025-05-07T20:26:30.2020905Z 2025-05-07T20:26:30.2021073Z  2025-05-07T20:26:30.2021285Z 2025-05-07T20:26:30.2021299Z 2025-05-07T20:26:30.2021304Z 2025-05-07T20:26:30.2021437Z 2025-05-07T20:26:30.2021451Z 2025-05-07T20:26:30.2021456Z 2025-05-07T20:26:30.2021461Z 2025-05-07T20:26:30.2021466Z 2025-05-07T20:26:30.2021471Z 2025-05-07T20:26:30.2021476Z 2025-05-07T20:26:30.2021668Z  2025-05-07T20:26:30.2021905Z 2025-05-07T20:26:30.2021910Z 2025-05-07T20:26:30.2021916Z 2025-05-07T20:26:30.2021921Z 2025-05-07T20:26:30.2021926Z 2025-05-07T20:26:30.2021931Z 2025-05-07T20:26:30.2021936Z 2025-05-07T20:26:30.2021941Z 2025-05-07T20:26:30.2021946Z 2025-05-07T20:26:30.2021952Z 2025-05-07T20:26:30.2021956Z 2025-05-07T20:26:30.2022151Z  2025-05-07T20:26:30.2022396Z 2025-05-07T20:26:30.2022401Z 2025-05-07T20:26:30.2022405Z 2025-05-07T20:26:30.2022410Z 2025-05-07T20:26:30.2022415Z 2025-05-07T20:26:30.2022420Z 2025-05-07T20:26:30.2022425Z 2025-05-07T20:26:30.2022430Z 2025-05-07T20:26:30.2022435Z 2025-05-07T20:26:30.2022440Z 2025-05-07T20:26:30.2022445Z 2025-05-07T20:26:30.2022450Z 2025-05-07T20:26:30.2022665Z  2025-05-07T20:26:30.2022919Z 2025-05-07T20:26:30.2022925Z 2025-05-07T20:26:30.2022930Z 2025-05-07T20:26:30.2022935Z 2025-05-07T20:26:30.2022940Z 2025-05-07T20:26:30.2022945Z 2025-05-07T20:26:30.2022950Z 2025-05-07T20:26:30.2022955Z 2025-05-07T20:26:30.2022960Z 2025-05-07T20:26:30.2022973Z 2025-05-07T20:26:30.2022978Z 2025-05-07T20:26:30.2022984Z 2025-05-07T20:26:30.2022989Z 2025-05-07T20:26:30.2023183Z  2025-05-07T20:26:30.2023433Z 2025-05-07T20:26:30.2023438Z 2025-05-07T20:26:30.2023452Z 2025-05-07T20:26:30.2023457Z 2025-05-07T20:26:30.2023462Z 2025-05-07T20:26:30.2023467Z 2025-05-07T20:26:30.2023472Z 2025-05-07T20:26:30.2023478Z 2025-05-07T20:26:30.2023492Z 2025-05-07T20:26:30.2023497Z 2025-05-07T20:26:30.2023502Z 2025-05-07T20:26:30.2023508Z 2025-05-07T20:26:30.2023512Z 2025-05-07T20:26:30.2023517Z 2025-05-07T20:26:30.2023708Z  2025-05-07T20:26:30.2023982Z 2025-05-07T20:26:30.2024090Z 2025-05-07T20:26:30.2024095Z 2025-05-07T20:26:30.2024100Z 2025-05-07T20:26:30.2024105Z 2025-05-07T20:26:30.2024110Z 2025-05-07T20:26:30.2024115Z 2025-05-07T20:26:30.2024120Z 2025-05-07T20:26:30.2024125Z 2025-05-07T20:26:30.2024143Z 2025-05-07T20:26:30.2024148Z 2025-05-07T20:26:30.2024153Z 2025-05-07T20:26:30.2024158Z 2025-05-07T20:26:30.2024163Z 2025-05-07T20:26:30.2024168Z 2025-05-07T20:26:30.2024379Z  2025-05-07T20:26:30.2024653Z 2025-05-07T20:26:30.2024658Z 2025-05-07T20:26:30.2024663Z 2025-05-07T20:26:30.2024668Z 2025-05-07T20:26:30.2024673Z 2025-05-07T20:26:30.2024678Z 2025-05-07T20:26:30.2024683Z 2025-05-07T20:26:30.2024697Z 2025-05-07T20:26:30.2024702Z 2025-05-07T20:26:30.2024707Z 2025-05-07T20:26:30.2024712Z 2025-05-07T20:26:30.2024717Z 2025-05-07T20:26:30.2024722Z 2025-05-07T20:26:30.2024727Z 2025-05-07T20:26:30.2024732Z 2025-05-07T20:26:30.2024737Z 2025-05-07T20:26:30.2024969Z  2025-05-07T20:26:30.2025256Z 2025-05-07T20:26:30.2025261Z 2025-05-07T20:26:30.2025266Z 2025-05-07T20:26:30.2025271Z 2025-05-07T20:26:30.2025277Z 2025-05-07T20:26:30.2025282Z 2025-05-07T20:26:30.2025287Z 2025-05-07T20:26:30.2025292Z 2025-05-07T20:26:30.2025296Z 2025-05-07T20:26:30.2025302Z 2025-05-07T20:26:30.2025306Z 2025-05-07T20:26:30.2025312Z 2025-05-07T20:26:30.2025317Z 2025-05-07T20:26:30.2025322Z 2025-05-07T20:26:30.2025327Z 2025-05-07T20:26:30.2025331Z 2025-05-07T20:26:30.2025337Z 2025-05-07T20:26:30.2025555Z  2025-05-07T20:26:30.2025830Z 2025-05-07T20:26:30.2025835Z 2025-05-07T20:26:30.2025840Z 2025-05-07T20:26:30.2025845Z 2025-05-07T20:26:30.2025850Z 2025-05-07T20:26:30.2025855Z 2025-05-07T20:26:30.2025860Z 2025-05-07T20:26:30.2025865Z 2025-05-07T20:26:30.2025870Z 2025-05-07T20:26:30.2025875Z 2025-05-07T20:26:30.2025899Z 2025-05-07T20:26:30.2025904Z 2025-05-07T20:26:30.2025909Z 2025-05-07T20:26:30.2026006Z 2025-05-07T20:26:30.2026019Z 2025-05-07T20:26:30.2026024Z 2025-05-07T20:26:30.2026029Z 2025-05-07T20:26:30.2026034Z 2025-05-07T20:26:30.2026258Z  2025-05-07T20:26:30.2026549Z 2025-05-07T20:26:30.2026554Z 2025-05-07T20:26:30.2026693Z  2025-05-07T20:26:30.2026831Z 2025-05-07T20:26:30.2026836Z 2025-05-07T20:26:30.2026983Z  2025-05-07T20:26:30.2027127Z 2025-05-07T20:26:30.2027132Z 2025-05-07T20:26:30.2027137Z 2025-05-07T20:26:30.2027302Z  2025-05-07T20:26:30.2027452Z 2025-05-07T20:26:30.2027457Z 2025-05-07T20:26:30.2027462Z 2025-05-07T20:26:30.2027467Z 2025-05-07T20:26:30.2027639Z  2025-05-07T20:26:30.2027795Z 2025-05-07T20:26:30.2027800Z 2025-05-07T20:26:30.2027805Z 2025-05-07T20:26:30.2027810Z 2025-05-07T20:26:30.2027815Z 2025-05-07T20:26:30.2027967Z  2025-05-07T20:26:30.2028138Z 2025-05-07T20:26:30.2028144Z 2025-05-07T20:26:30.2028149Z 2025-05-07T20:26:30.2028155Z 2025-05-07T20:26:30.2028168Z 2025-05-07T20:26:30.2028178Z 2025-05-07T20:26:30.2028339Z  2025-05-07T20:26:30.2028519Z 2025-05-07T20:26:30.2028524Z 2025-05-07T20:26:30.2028529Z 2025-05-07T20:26:30.2028534Z 2025-05-07T20:26:30.2028540Z 2025-05-07T20:26:30.2028544Z 2025-05-07T20:26:30.2028550Z 2025-05-07T20:26:30.2028716Z  2025-05-07T20:26:30.2028908Z 2025-05-07T20:26:30.2028913Z 2025-05-07T20:26:30.2028918Z 2025-05-07T20:26:30.2028924Z 2025-05-07T20:26:30.2028929Z 2025-05-07T20:26:30.2028934Z 2025-05-07T20:26:30.2028939Z 2025-05-07T20:26:30.2028944Z 2025-05-07T20:26:30.2029134Z  2025-05-07T20:26:30.2029335Z 2025-05-07T20:26:30.2029340Z 2025-05-07T20:26:30.2029346Z 2025-05-07T20:26:30.2029351Z 2025-05-07T20:26:30.2029357Z 2025-05-07T20:26:30.2029372Z 2025-05-07T20:26:30.2029377Z 2025-05-07T20:26:30.2029382Z 2025-05-07T20:26:30.2029386Z 2025-05-07T20:26:30.2029564Z  2025-05-07T20:26:30.2029772Z 2025-05-07T20:26:30.2029783Z 2025-05-07T20:26:30.2029897Z 2025-05-07T20:26:30.2029902Z 2025-05-07T20:26:30.2029907Z 2025-05-07T20:26:30.2029912Z 2025-05-07T20:26:30.2029917Z 2025-05-07T20:26:30.2029922Z 2025-05-07T20:26:30.2029927Z 2025-05-07T20:26:30.2029932Z 2025-05-07T20:26:30.2030130Z  2025-05-07T20:26:30.2030354Z 2025-05-07T20:26:30.2030359Z 2025-05-07T20:26:30.2030364Z 2025-05-07T20:26:30.2030369Z 2025-05-07T20:26:30.2030374Z 2025-05-07T20:26:30.2030379Z 2025-05-07T20:26:30.2030384Z 2025-05-07T20:26:30.2030389Z 2025-05-07T20:26:30.2030404Z 2025-05-07T20:26:30.2030409Z 2025-05-07T20:26:30.2030414Z 2025-05-07T20:26:30.2030611Z  2025-05-07T20:26:30.2030839Z 2025-05-07T20:26:30.2030844Z 2025-05-07T20:26:30.2030850Z 2025-05-07T20:26:30.2030861Z 2025-05-07T20:26:30.2030867Z 2025-05-07T20:26:30.2030872Z 2025-05-07T20:26:30.2030878Z 2025-05-07T20:26:30.2030883Z 2025-05-07T20:26:30.2030889Z 2025-05-07T20:26:30.2030894Z 2025-05-07T20:26:30.2030906Z 2025-05-07T20:26:30.2030918Z 2025-05-07T20:26:30.2031116Z  2025-05-07T20:26:30.2031387Z 2025-05-07T20:26:30.2031392Z 2025-05-07T20:26:30.2031397Z 2025-05-07T20:26:30.2031402Z 2025-05-07T20:26:30.2031407Z 2025-05-07T20:26:30.2031413Z 2025-05-07T20:26:30.2031418Z 2025-05-07T20:26:30.2031423Z 2025-05-07T20:26:30.2031428Z 2025-05-07T20:26:30.2031433Z 2025-05-07T20:26:30.2031438Z 2025-05-07T20:26:30.2031443Z 2025-05-07T20:26:30.2031449Z 2025-05-07T20:26:30.2031643Z  2025-05-07T20:26:30.2031891Z 2025-05-07T20:26:30.2031896Z 2025-05-07T20:26:30.2031902Z 2025-05-07T20:26:30.2031907Z 2025-05-07T20:26:30.2031912Z 2025-05-07T20:26:30.2031917Z 2025-05-07T20:26:30.2031922Z 2025-05-07T20:26:30.2031927Z 2025-05-07T20:26:30.2031932Z 2025-05-07T20:26:30.2031937Z 2025-05-07T20:26:30.2031943Z 2025-05-07T20:26:30.2031948Z 2025-05-07T20:26:30.2031953Z 2025-05-07T20:26:30.2031958Z 2025-05-07T20:26:30.2032321Z  2025-05-07T20:26:30.2032585Z 2025-05-07T20:26:30.2032590Z 2025-05-07T20:26:30.2032595Z 2025-05-07T20:26:30.2032607Z 2025-05-07T20:26:30.2032612Z 2025-05-07T20:26:30.2032617Z 2025-05-07T20:26:30.2032622Z 2025-05-07T20:26:30.2032627Z 2025-05-07T20:26:30.2032632Z 2025-05-07T20:26:30.2032637Z 2025-05-07T20:26:30.2032642Z 2025-05-07T20:26:30.2032647Z 2025-05-07T20:26:30.2032652Z 2025-05-07T20:26:30.2032657Z 2025-05-07T20:26:30.2032662Z 2025-05-07T20:26:30.2032866Z  2025-05-07T20:26:30.2033135Z 2025-05-07T20:26:30.2033140Z 2025-05-07T20:26:30.2033145Z 2025-05-07T20:26:30.2033150Z 2025-05-07T20:26:30.2033155Z 2025-05-07T20:26:30.2033160Z 2025-05-07T20:26:30.2033165Z 2025-05-07T20:26:30.2033171Z 2025-05-07T20:26:30.2033175Z 2025-05-07T20:26:30.2033181Z 2025-05-07T20:26:30.2033185Z 2025-05-07T20:26:30.2033191Z 2025-05-07T20:26:30.2033196Z 2025-05-07T20:26:30.2033201Z 2025-05-07T20:26:30.2033206Z 2025-05-07T20:26:30.2033217Z 2025-05-07T20:26:30.2033434Z  2025-05-07T20:26:30.2033719Z 2025-05-07T20:26:30.2033724Z 2025-05-07T20:26:30.2033729Z 2025-05-07T20:26:30.2033734Z 2025-05-07T20:26:30.2033739Z 2025-05-07T20:26:30.2033744Z 2025-05-07T20:26:30.2033749Z 2025-05-07T20:26:30.2033763Z 2025-05-07T20:26:30.2033768Z 2025-05-07T20:26:30.2033773Z 2025-05-07T20:26:30.2033778Z 2025-05-07T20:26:30.2033783Z 2025-05-07T20:26:30.2033788Z 2025-05-07T20:26:30.2033793Z 2025-05-07T20:26:30.2033798Z 2025-05-07T20:26:30.2033804Z 2025-05-07T20:26:30.2033809Z 2025-05-07T20:26:30.2034029Z  2025-05-07T20:26:30.2034312Z 2025-05-07T20:26:30.2034317Z 2025-05-07T20:26:30.2034322Z 2025-05-07T20:26:30.2034327Z 2025-05-07T20:26:30.2034332Z 2025-05-07T20:26:30.2034338Z 2025-05-07T20:26:30.2034343Z 2025-05-07T20:26:30.2034348Z 2025-05-07T20:26:30.2034353Z 2025-05-07T20:26:30.2034359Z 2025-05-07T20:26:30.2034363Z 2025-05-07T20:26:30.2034470Z 2025-05-07T20:26:30.2034475Z 2025-05-07T20:26:30.2034481Z 2025-05-07T20:26:30.2034485Z 2025-05-07T20:26:30.2034491Z 2025-05-07T20:26:30.2034495Z 2025-05-07T20:26:30.2034500Z 2025-05-07T20:26:30.2034733Z  2025-05-07T20:26:30.2035013Z 2025-05-07T20:26:30.2035019Z 2025-05-07T20:26:30.2035151Z  2025-05-07T20:26:30.2035296Z 2025-05-07T20:26:30.2035301Z 2025-05-07T20:26:30.2035453Z  2025-05-07T20:26:30.2035628Z 2025-05-07T20:26:30.2035633Z 2025-05-07T20:26:30.2035638Z 2025-05-07T20:26:30.2035802Z  2025-05-07T20:26:30.2035963Z 2025-05-07T20:26:30.2035969Z 2025-05-07T20:26:30.2035974Z 2025-05-07T20:26:30.2035979Z 2025-05-07T20:26:30.2036120Z  2025-05-07T20:26:30.2036279Z 2025-05-07T20:26:30.2036285Z 2025-05-07T20:26:30.2036290Z 2025-05-07T20:26:30.2036295Z 2025-05-07T20:26:30.2036308Z 2025-05-07T20:26:30.2036457Z  2025-05-07T20:26:30.2036621Z 2025-05-07T20:26:30.2036632Z 2025-05-07T20:26:30.2036644Z 2025-05-07T20:26:30.2036649Z 2025-05-07T20:26:30.2036655Z 2025-05-07T20:26:30.2036660Z 2025-05-07T20:26:30.2036839Z  2025-05-07T20:26:30.2037008Z 2025-05-07T20:26:30.2037013Z 2025-05-07T20:26:30.2037019Z 2025-05-07T20:26:30.2037024Z 2025-05-07T20:26:30.2037029Z 2025-05-07T20:26:30.2037041Z 2025-05-07T20:26:30.2037046Z 2025-05-07T20:26:30.2037205Z  2025-05-07T20:26:30.2037395Z 2025-05-07T20:26:30.2037400Z 2025-05-07T20:26:30.2037405Z 2025-05-07T20:26:30.2037410Z 2025-05-07T20:26:30.2037415Z 2025-05-07T20:26:30.2037427Z 2025-05-07T20:26:30.2037433Z 2025-05-07T20:26:30.2037438Z 2025-05-07T20:26:30.2037602Z  2025-05-07T20:26:30.2037800Z 2025-05-07T20:26:30.2037805Z 2025-05-07T20:26:30.2037810Z 2025-05-07T20:26:30.2037815Z 2025-05-07T20:26:30.2037820Z 2025-05-07T20:26:30.2037837Z 2025-05-07T20:26:30.2037842Z 2025-05-07T20:26:30.2037847Z 2025-05-07T20:26:30.2037852Z 2025-05-07T20:26:30.2038117Z  2025-05-07T20:26:30.2038333Z 2025-05-07T20:26:30.2038339Z 2025-05-07T20:26:30.2038344Z 2025-05-07T20:26:30.2038358Z 2025-05-07T20:26:30.2038363Z 2025-05-07T20:26:30.2038368Z 2025-05-07T20:26:30.2038373Z 2025-05-07T20:26:30.2038378Z 2025-05-07T20:26:30.2038383Z 2025-05-07T20:26:30.2038388Z 2025-05-07T20:26:30.2038585Z  2025-05-07T20:26:30.2038813Z 2025-05-07T20:26:30.2038818Z 2025-05-07T20:26:30.2038823Z 2025-05-07T20:26:30.2038828Z 2025-05-07T20:26:30.2038833Z 2025-05-07T20:26:30.2038837Z 2025-05-07T20:26:30.2038842Z 2025-05-07T20:26:30.2038847Z 2025-05-07T20:26:30.2038852Z 2025-05-07T20:26:30.2038857Z 2025-05-07T20:26:30.2038862Z 2025-05-07T20:26:30.2039042Z  2025-05-07T20:26:30.2039284Z 2025-05-07T20:26:30.2039289Z 2025-05-07T20:26:30.2039294Z 2025-05-07T20:26:30.2039299Z 2025-05-07T20:26:30.2039304Z 2025-05-07T20:26:30.2039309Z 2025-05-07T20:26:30.2039314Z 2025-05-07T20:26:30.2039320Z 2025-05-07T20:26:30.2039335Z 2025-05-07T20:26:30.2039341Z 2025-05-07T20:26:30.2039346Z 2025-05-07T20:26:30.2039351Z 2025-05-07T20:26:30.2039534Z  2025-05-07T20:26:30.2039773Z 2025-05-07T20:26:30.2039778Z 2025-05-07T20:26:30.2039784Z 2025-05-07T20:26:30.2039789Z 2025-05-07T20:26:30.2039794Z 2025-05-07T20:26:30.2039799Z 2025-05-07T20:26:30.2039804Z 2025-05-07T20:26:30.2039809Z 2025-05-07T20:26:30.2039815Z 2025-05-07T20:26:30.2039820Z 2025-05-07T20:26:30.2039825Z 2025-05-07T20:26:30.2039830Z 2025-05-07T20:26:30.2039835Z 2025-05-07T20:26:30.2040026Z  2025-05-07T20:26:30.2040277Z 2025-05-07T20:26:30.2040282Z 2025-05-07T20:26:30.2040287Z 2025-05-07T20:26:30.2040292Z 2025-05-07T20:26:30.2040297Z 2025-05-07T20:26:30.2040302Z 2025-05-07T20:26:30.2040307Z 2025-05-07T20:26:30.2040340Z 2025-05-07T20:26:30.2040345Z 2025-05-07T20:26:30.2040351Z 2025-05-07T20:26:30.2040356Z 2025-05-07T20:26:30.2040360Z 2025-05-07T20:26:30.2040370Z 2025-05-07T20:26:30.2040474Z 2025-05-07T20:26:30.2040666Z  2025-05-07T20:26:30.2040933Z 2025-05-07T20:26:30.2040938Z 2025-05-07T20:26:30.2040943Z 2025-05-07T20:26:30.2040948Z 2025-05-07T20:26:30.2040953Z 2025-05-07T20:26:30.2040958Z 2025-05-07T20:26:30.2040962Z 2025-05-07T20:26:30.2040968Z 2025-05-07T20:26:30.2040973Z 2025-05-07T20:26:30.2040978Z 2025-05-07T20:26:30.2040983Z 2025-05-07T20:26:30.2040988Z 2025-05-07T20:26:30.2040993Z 2025-05-07T20:26:30.2040998Z 2025-05-07T20:26:30.2041003Z 2025-05-07T20:26:30.2041210Z  2025-05-07T20:26:30.2041477Z 2025-05-07T20:26:30.2041482Z 2025-05-07T20:26:30.2041487Z 2025-05-07T20:26:30.2041492Z 2025-05-07T20:26:30.2041497Z 2025-05-07T20:26:30.2041503Z 2025-05-07T20:26:30.2041508Z 2025-05-07T20:26:30.2041513Z 2025-05-07T20:26:30.2041518Z 2025-05-07T20:26:30.2041529Z 2025-05-07T20:26:30.2041535Z 2025-05-07T20:26:30.2041540Z 2025-05-07T20:26:30.2041545Z 2025-05-07T20:26:30.2041562Z 2025-05-07T20:26:30.2041567Z 2025-05-07T20:26:30.2041572Z 2025-05-07T20:26:30.2041776Z  2025-05-07T20:26:30.2042046Z 2025-05-07T20:26:30.2042059Z 2025-05-07T20:26:30.2042065Z 2025-05-07T20:26:30.2042070Z 2025-05-07T20:26:30.2042075Z 2025-05-07T20:26:30.2042080Z 2025-05-07T20:26:30.2042085Z 2025-05-07T20:26:30.2042089Z 2025-05-07T20:26:30.2042094Z 2025-05-07T20:26:30.2042099Z 2025-05-07T20:26:30.2042104Z 2025-05-07T20:26:30.2042109Z 2025-05-07T20:26:30.2042114Z 2025-05-07T20:26:30.2042119Z 2025-05-07T20:26:30.2042124Z 2025-05-07T20:26:30.2042129Z 2025-05-07T20:26:30.2042134Z 2025-05-07T20:26:30.2042347Z  2025-05-07T20:26:30.2042621Z 2025-05-07T20:26:30.2042627Z 2025-05-07T20:26:30.2042632Z 2025-05-07T20:26:30.2042637Z 2025-05-07T20:26:30.2042642Z 2025-05-07T20:26:30.2042647Z 2025-05-07T20:26:30.2042652Z 2025-05-07T20:26:30.2042657Z 2025-05-07T20:26:30.2042758Z 2025-05-07T20:26:30.2042771Z 2025-05-07T20:26:30.2042776Z 2025-05-07T20:26:30.2042781Z 2025-05-07T20:26:30.2042786Z 2025-05-07T20:26:30.2042791Z 2025-05-07T20:26:30.2042805Z 2025-05-07T20:26:30.2042811Z 2025-05-07T20:26:30.2042816Z 2025-05-07T20:26:30.2042820Z 2025-05-07T20:26:30.2043058Z  2025-05-07T20:26:30.2043350Z 2025-05-07T20:26:30.2043355Z 2025-05-07T20:26:30.2043495Z  2025-05-07T20:26:30.2043638Z 2025-05-07T20:26:30.2043643Z 2025-05-07T20:26:30.2043785Z  2025-05-07T20:26:30.2043927Z 2025-05-07T20:26:30.2043932Z 2025-05-07T20:26:30.2043936Z 2025-05-07T20:26:30.2044077Z  2025-05-07T20:26:30.2044228Z 2025-05-07T20:26:30.2044233Z 2025-05-07T20:26:30.2044238Z 2025-05-07T20:26:30.2044244Z 2025-05-07T20:26:30.2044387Z  2025-05-07T20:26:30.2044551Z 2025-05-07T20:26:30.2044556Z 2025-05-07T20:26:30.2044561Z 2025-05-07T20:26:30.2044566Z 2025-05-07T20:26:30.2044571Z 2025-05-07T20:26:30.2044724Z  2025-05-07T20:26:30.2044903Z 2025-05-07T20:26:30.2044908Z 2025-05-07T20:26:30.2044913Z 2025-05-07T20:26:30.2044919Z 2025-05-07T20:26:30.2044924Z 2025-05-07T20:26:30.2044929Z 2025-05-07T20:26:30.2045082Z  2025-05-07T20:26:30.2045250Z 2025-05-07T20:26:30.2045263Z 2025-05-07T20:26:30.2045268Z 2025-05-07T20:26:30.2045274Z 2025-05-07T20:26:30.2045279Z 2025-05-07T20:26:30.2045284Z 2025-05-07T20:26:30.2045289Z 2025-05-07T20:26:30.2045461Z  2025-05-07T20:26:30.2045660Z 2025-05-07T20:26:30.2045666Z 2025-05-07T20:26:30.2045673Z 2025-05-07T20:26:30.2045679Z 2025-05-07T20:26:30.2045686Z 2025-05-07T20:26:30.2045692Z 2025-05-07T20:26:30.2045698Z 2025-05-07T20:26:30.2045705Z 2025-05-07T20:26:30.2045892Z  2025-05-07T20:26:30.2046107Z 2025-05-07T20:26:30.2046112Z 2025-05-07T20:26:30.2046117Z 2025-05-07T20:26:30.2046122Z 2025-05-07T20:26:30.2046127Z 2025-05-07T20:26:30.2046133Z 2025-05-07T20:26:30.2046137Z 2025-05-07T20:26:30.2046148Z 2025-05-07T20:26:30.2046260Z 2025-05-07T20:26:30.2046438Z  2025-05-07T20:26:30.2046661Z 2025-05-07T20:26:30.2046666Z 2025-05-07T20:26:30.2046672Z 2025-05-07T20:26:30.2046677Z 2025-05-07T20:26:30.2046682Z 2025-05-07T20:26:30.2046687Z 2025-05-07T20:26:30.2046692Z 2025-05-07T20:26:30.2046697Z 2025-05-07T20:26:30.2046702Z 2025-05-07T20:26:30.2046707Z 2025-05-07T20:26:30.2046888Z  2025-05-07T20:26:30.2047107Z 2025-05-07T20:26:30.2047112Z 2025-05-07T20:26:30.2047117Z 2025-05-07T20:26:30.2047122Z 2025-05-07T20:26:30.2047127Z 2025-05-07T20:26:30.2047132Z 2025-05-07T20:26:30.2047137Z 2025-05-07T20:26:30.2047142Z 2025-05-07T20:26:30.2047147Z 2025-05-07T20:26:30.2047152Z 2025-05-07T20:26:30.2047157Z 2025-05-07T20:26:30.2047340Z  2025-05-07T20:26:30.2047570Z 2025-05-07T20:26:30.2047575Z 2025-05-07T20:26:30.2047580Z 2025-05-07T20:26:30.2047585Z 2025-05-07T20:26:30.2047590Z 2025-05-07T20:26:30.2047600Z 2025-05-07T20:26:30.2047611Z 2025-05-07T20:26:30.2047617Z 2025-05-07T20:26:30.2047621Z 2025-05-07T20:26:30.2047627Z 2025-05-07T20:26:30.2047639Z 2025-05-07T20:26:30.2047644Z 2025-05-07T20:26:30.2047823Z  2025-05-07T20:26:30.2048061Z 2025-05-07T20:26:30.2048066Z 2025-05-07T20:26:30.2048071Z 2025-05-07T20:26:30.2048076Z 2025-05-07T20:26:30.2048081Z 2025-05-07T20:26:30.2048086Z 2025-05-07T20:26:30.2048099Z 2025-05-07T20:26:30.2048104Z 2025-05-07T20:26:30.2048108Z 2025-05-07T20:26:30.2048138Z 2025-05-07T20:26:30.2048143Z 2025-05-07T20:26:30.2048148Z 2025-05-07T20:26:30.2048154Z 2025-05-07T20:26:30.2048321Z  2025-05-07T20:26:30.2048572Z 2025-05-07T20:26:30.2048577Z 2025-05-07T20:26:30.2048580Z 2025-05-07T20:26:30.2048584Z 2025-05-07T20:26:30.2048587Z 2025-05-07T20:26:30.2048591Z 2025-05-07T20:26:30.2048595Z 2025-05-07T20:26:30.2048598Z 2025-05-07T20:26:30.2048602Z 2025-05-07T20:26:30.2048606Z 2025-05-07T20:26:30.2048704Z 2025-05-07T20:26:30.2048714Z 2025-05-07T20:26:30.2048718Z 2025-05-07T20:26:30.2048722Z 2025-05-07T20:26:30.2048891Z  2025-05-07T20:26:30.2049086Z 2025-05-07T20:26:30.2049089Z 2025-05-07T20:26:30.2049093Z 2025-05-07T20:26:30.2049097Z 2025-05-07T20:26:30.2049100Z 2025-05-07T20:26:30.2049104Z 2025-05-07T20:26:30.2049108Z 2025-05-07T20:26:30.2049111Z 2025-05-07T20:26:30.2049115Z 2025-05-07T20:26:30.2049118Z 2025-05-07T20:26:30.2049129Z 2025-05-07T20:26:30.2049133Z 2025-05-07T20:26:30.2049136Z 2025-05-07T20:26:30.2049140Z 2025-05-07T20:26:30.2049143Z 2025-05-07T20:26:30.2049294Z  2025-05-07T20:26:30.2049490Z 2025-05-07T20:26:30.2049493Z 2025-05-07T20:26:30.2049502Z 2025-05-07T20:26:30.2049506Z 2025-05-07T20:26:30.2049510Z 2025-05-07T20:26:30.2049514Z 2025-05-07T20:26:30.2049517Z 2025-05-07T20:26:30.2049521Z 2025-05-07T20:26:30.2049524Z 2025-05-07T20:26:30.2049528Z 2025-05-07T20:26:30.2049537Z 2025-05-07T20:26:30.2049543Z 2025-05-07T20:26:30.2049546Z 2025-05-07T20:26:30.2049550Z 2025-05-07T20:26:30.2049553Z 2025-05-07T20:26:30.2049557Z 2025-05-07T20:26:30.2049709Z  2025-05-07T20:26:30.2049918Z 2025-05-07T20:26:30.2049922Z 2025-05-07T20:26:30.2049925Z 2025-05-07T20:26:30.2049929Z 2025-05-07T20:26:30.2049932Z 2025-05-07T20:26:30.2049936Z 2025-05-07T20:26:30.2049939Z 2025-05-07T20:26:30.2049943Z 2025-05-07T20:26:30.2049946Z 2025-05-07T20:26:30.2049950Z 2025-05-07T20:26:30.2049954Z 2025-05-07T20:26:30.2049957Z 2025-05-07T20:26:30.2049961Z 2025-05-07T20:26:30.2049964Z 2025-05-07T20:26:30.2049968Z 2025-05-07T20:26:30.2049971Z 2025-05-07T20:26:30.2049975Z 2025-05-07T20:26:30.2050141Z  2025-05-07T20:26:30.2050342Z 2025-05-07T20:26:30.2050346Z 2025-05-07T20:26:30.2050350Z 2025-05-07T20:26:30.2050353Z 2025-05-07T20:26:30.2050357Z 2025-05-07T20:26:30.2050361Z 2025-05-07T20:26:30.2050369Z 2025-05-07T20:26:30.2050451Z 2025-05-07T20:26:30.2050462Z 2025-05-07T20:26:30.2050465Z 2025-05-07T20:26:30.2050469Z 2025-05-07T20:26:30.2050473Z 2025-05-07T20:26:30.2050476Z 2025-05-07T20:26:30.2050480Z 2025-05-07T20:26:30.2050483Z 2025-05-07T20:26:30.2050487Z 2025-05-07T20:26:30.2050490Z 2025-05-07T20:26:30.2050494Z 2025-05-07T20:26:30.2050661Z  2025-05-07T20:26:30.2050939Z 2025-05-07T20:26:30.2050945Z 2025-05-07T20:26:30.2051055Z  2025-05-07T20:26:30.2051201Z 2025-05-07T20:26:30.2051206Z 2025-05-07T20:26:30.2051346Z  2025-05-07T20:26:30.2051473Z 2025-05-07T20:26:30.2051479Z 2025-05-07T20:26:30.2051491Z 2025-05-07T20:26:30.2051633Z  2025-05-07T20:26:30.2051740Z 2025-05-07T20:26:30.2051744Z 2025-05-07T20:26:30.2051751Z 2025-05-07T20:26:30.2051756Z 2025-05-07T20:26:30.2051920Z  2025-05-07T20:26:30.2052039Z 2025-05-07T20:26:30.2052042Z 2025-05-07T20:26:30.2052046Z 2025-05-07T20:26:30.2052056Z 2025-05-07T20:26:30.2052065Z 2025-05-07T20:26:30.2052218Z  2025-05-07T20:26:30.2052369Z 2025-05-07T20:26:30.2052373Z 2025-05-07T20:26:30.2052377Z 2025-05-07T20:26:30.2052380Z 2025-05-07T20:26:30.2052384Z 2025-05-07T20:26:30.2052387Z 2025-05-07T20:26:30.2052536Z  2025-05-07T20:26:30.2052688Z 2025-05-07T20:26:30.2052692Z 2025-05-07T20:26:30.2052696Z 2025-05-07T20:26:30.2052699Z 2025-05-07T20:26:30.2052703Z 2025-05-07T20:26:30.2052706Z 2025-05-07T20:26:30.2052710Z 2025-05-07T20:26:30.2052856Z  2025-05-07T20:26:30.2053125Z 2025-05-07T20:26:30.2053129Z 2025-05-07T20:26:30.2053132Z 2025-05-07T20:26:30.2053136Z 2025-05-07T20:26:30.2053140Z 2025-05-07T20:26:30.2053143Z 2025-05-07T20:26:30.2053147Z 2025-05-07T20:26:30.2053150Z 2025-05-07T20:26:30.2053291Z  2025-05-07T20:26:30.2053502Z 2025-05-07T20:26:30.2053506Z 2025-05-07T20:26:30.2053509Z 2025-05-07T20:26:30.2053513Z 2025-05-07T20:26:30.2053516Z 2025-05-07T20:26:30.2053670Z 2025-05-07T20:26:30.2053683Z 2025-05-07T20:26:30.2053689Z 2025-05-07T20:26:30.2053694Z 2025-05-07T20:26:30.2053852Z  2025-05-07T20:26:30.2054012Z 2025-05-07T20:26:30.2054016Z 2025-05-07T20:26:30.2054019Z 2025-05-07T20:26:30.2054023Z 2025-05-07T20:26:30.2054026Z 2025-05-07T20:26:30.2054030Z 2025-05-07T20:26:30.2054034Z 2025-05-07T20:26:30.2054037Z 2025-05-07T20:26:30.2054041Z 2025-05-07T20:26:30.2054044Z 2025-05-07T20:26:30.2054181Z  2025-05-07T20:26:30.2054344Z 2025-05-07T20:26:30.2054348Z 2025-05-07T20:26:30.2054352Z 2025-05-07T20:26:30.2054355Z 2025-05-07T20:26:30.2054359Z 2025-05-07T20:26:30.2054363Z 2025-05-07T20:26:30.2054366Z 2025-05-07T20:26:30.2054370Z 2025-05-07T20:26:30.2054373Z 2025-05-07T20:26:30.2054377Z 2025-05-07T20:26:30.2054381Z 2025-05-07T20:26:30.2054545Z  2025-05-07T20:26:30.2054717Z 2025-05-07T20:26:30.2054729Z 2025-05-07T20:26:30.2054733Z 2025-05-07T20:26:30.2054743Z 2025-05-07T20:26:30.2054750Z 2025-05-07T20:26:30.2054753Z 2025-05-07T20:26:30.2054757Z 2025-05-07T20:26:30.2054760Z 2025-05-07T20:26:30.2054764Z 2025-05-07T20:26:30.2054768Z 2025-05-07T20:26:30.2054771Z 2025-05-07T20:26:30.2054775Z 2025-05-07T20:26:30.2054924Z  2025-05-07T20:26:30.2055185Z 2025-05-07T20:26:30.2055190Z 2025-05-07T20:26:30.2055195Z 2025-05-07T20:26:30.2055201Z 2025-05-07T20:26:30.2055206Z 2025-05-07T20:26:30.2055211Z 2025-05-07T20:26:30.2055217Z 2025-05-07T20:26:30.2055220Z 2025-05-07T20:26:30.2055224Z 2025-05-07T20:26:30.2055228Z 2025-05-07T20:26:30.2055231Z 2025-05-07T20:26:30.2055235Z 2025-05-07T20:26:30.2055238Z 2025-05-07T20:26:30.2055386Z  2025-05-07T20:26:30.2055568Z 2025-05-07T20:26:30.2055572Z 2025-05-07T20:26:30.2055576Z 2025-05-07T20:26:30.2055579Z 2025-05-07T20:26:30.2055583Z 2025-05-07T20:26:30.2055587Z 2025-05-07T20:26:30.2055590Z 2025-05-07T20:26:30.2055594Z 2025-05-07T20:26:30.2055602Z 2025-05-07T20:26:30.2055710Z 2025-05-07T20:26:30.2055715Z 2025-05-07T20:26:30.2055720Z 2025-05-07T20:26:30.2055725Z 2025-05-07T20:26:30.2055730Z 2025-05-07T20:26:30.2055937Z  2025-05-07T20:26:30.2056185Z 2025-05-07T20:26:30.2056189Z 2025-05-07T20:26:30.2056192Z 2025-05-07T20:26:30.2056196Z 2025-05-07T20:26:30.2056199Z 2025-05-07T20:26:30.2056203Z 2025-05-07T20:26:30.2056206Z 2025-05-07T20:26:30.2056217Z 2025-05-07T20:26:30.2056220Z 2025-05-07T20:26:30.2056224Z 2025-05-07T20:26:30.2056227Z 2025-05-07T20:26:30.2056231Z 2025-05-07T20:26:30.2056234Z 2025-05-07T20:26:30.2056238Z 2025-05-07T20:26:30.2056242Z 2025-05-07T20:26:30.2056400Z  2025-05-07T20:26:30.2056599Z 2025-05-07T20:26:30.2056603Z 2025-05-07T20:26:30.2056607Z 2025-05-07T20:26:30.2056610Z 2025-05-07T20:26:30.2056614Z 2025-05-07T20:26:30.2056617Z 2025-05-07T20:26:30.2056621Z 2025-05-07T20:26:30.2056624Z 2025-05-07T20:26:30.2056633Z 2025-05-07T20:26:30.2056642Z 2025-05-07T20:26:30.2056646Z 2025-05-07T20:26:30.2056649Z 2025-05-07T20:26:30.2056653Z 2025-05-07T20:26:30.2056657Z 2025-05-07T20:26:30.2056660Z 2025-05-07T20:26:30.2056664Z 2025-05-07T20:26:30.2056821Z  2025-05-07T20:26:30.2057016Z 2025-05-07T20:26:30.2057020Z 2025-05-07T20:26:30.2057023Z 2025-05-07T20:26:30.2057027Z 2025-05-07T20:26:30.2057031Z 2025-05-07T20:26:30.2057034Z 2025-05-07T20:26:30.2057038Z 2025-05-07T20:26:30.2057041Z 2025-05-07T20:26:30.2057045Z 2025-05-07T20:26:30.2057048Z 2025-05-07T20:26:30.2057052Z 2025-05-07T20:26:30.2057055Z 2025-05-07T20:26:30.2057059Z 2025-05-07T20:26:30.2057063Z 2025-05-07T20:26:30.2057066Z 2025-05-07T20:26:30.2057076Z 2025-05-07T20:26:30.2057079Z 2025-05-07T20:26:30.2057233Z  2025-05-07T20:26:30.2057436Z 2025-05-07T20:26:30.2057440Z 2025-05-07T20:26:30.2057443Z 2025-05-07T20:26:30.2057447Z 2025-05-07T20:26:30.2057553Z 2025-05-07T20:26:30.2057572Z 2025-05-07T20:26:30.2057576Z 2025-05-07T20:26:30.2057579Z 2025-05-07T20:26:30.2057583Z 2025-05-07T20:26:30.2057586Z 2025-05-07T20:26:30.2057590Z 2025-05-07T20:26:30.2057594Z 2025-05-07T20:26:30.2057597Z 2025-05-07T20:26:30.2057601Z 2025-05-07T20:26:30.2057604Z 2025-05-07T20:26:30.2057608Z 2025-05-07T20:26:30.2057611Z 2025-05-07T20:26:30.2057615Z 2025-05-07T20:26:30.2057778Z  2025-05-07T20:26:30.2057985Z 2025-05-07T20:26:30.2057989Z 2025-05-07T20:26:30.2058088Z  2025-05-07T20:26:30.2058195Z 2025-05-07T20:26:30.2058199Z 2025-05-07T20:26:30.2058309Z  2025-05-07T20:26:30.2058415Z 2025-05-07T20:26:30.2058419Z 2025-05-07T20:26:30.2058422Z 2025-05-07T20:26:30.2058528Z  2025-05-07T20:26:30.2058635Z 2025-05-07T20:26:30.2058638Z 2025-05-07T20:26:30.2058642Z 2025-05-07T20:26:30.2058646Z 2025-05-07T20:26:30.2058753Z  2025-05-07T20:26:30.2058866Z 2025-05-07T20:26:30.2058875Z 2025-05-07T20:26:30.2058882Z 2025-05-07T20:26:30.2058886Z 2025-05-07T20:26:30.2058889Z 2025-05-07T20:26:30.2059005Z  2025-05-07T20:26:30.2059125Z 2025-05-07T20:26:30.2059129Z 2025-05-07T20:26:30.2059132Z 2025-05-07T20:26:30.2059136Z 2025-05-07T20:26:30.2059141Z 2025-05-07T20:26:30.2059144Z 2025-05-07T20:26:30.2060368Z  done 2025-05-07T20:26:30.4095251Z Preparing transaction: \ | done 2025-05-07T20:26:31.5188387Z Verifying transaction: - \ | / - \ | / - \ | done 2025-05-07T20:26:32.1257232Z Executing transaction: - \ | / - \ done 2025-05-07T20:26:34.2974674Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:34.2975091Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:34.2975773Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:34.2976317Z 2025-05-07T20:26:34.2988517Z 2025-05-07T20:26:34.2989227Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:34.2990281Z 2025-05-07T20:26:34.3002804Z 2025-05-07T20:26:34.3003115Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:34.3008044Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:34.3011769Z 2025-05-07T20:26:34.4648979Z 2025-05-07T20:26:34.4654369Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:34.4658085Z 2025-05-07T20:26:34.4679454Z 2025-05-07T20:26:34.4679803Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:34.5052952Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:36.3928665Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:36.4560400Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:36.4560991Z 2025-05-07T20:26:36.8772540Z 2025-05-07T20:26:36.8780893Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:36.9130407Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:36.9130933Z 2025-05-07T20:26:37.3449707Z 2025-05-07T20:26:37.3450120Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:37.3451024Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:37.3451726Z 2025-05-07T20:26:37.7687765Z 2025-05-07T20:26:39.7859802Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:41.8020353Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:43.8203979Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:43.8205104Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:45.8428408Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:47.7295654Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:47.7295960Z 2025-05-07T20:26:47.7942021Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:51.6422949Z /tmp/tmp4nff3e7b: line 3: clang: command not found 2025-05-07T20:26:51.6423247Z 2025-05-07T20:26:51.6423769Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:51.7067495Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:51.7067814Z 2025-05-07T20:26:51.7088351Z total 36 2025-05-07T20:26:51.7088703Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:26:51.7089103Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:25 .. 2025-05-07T20:26:51.7089563Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:51.7090058Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:51.7090532Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:51.7090985Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:51.7091414Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:51.7091856Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:26:51.7092145Z 2025-05-07T20:26:51.7092351Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:51.7092972Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:51.7093506Z 2025-05-07T20:26:51.7114798Z 2025-05-07T20:26:51.7115542Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:51.7115822Z 2025-05-07T20:26:53.6637376Z 2025-05-07T20:26:53.6637989Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:53.6638534Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:53.6638908Z 2025-05-07T20:26:54.0896471Z 2025-05-07T20:26:54.0897117Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:54.0897413Z 2025-05-07T20:26:55.9770967Z -allow-unsupported-compiler 2025-05-07T20:26:55.9771225Z 2025-05-07T20:26:56.0420363Z 2025-05-07T20:26:56.0420825Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:56.0421311Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:56.0421633Z 2025-05-07T20:26:58.0047709Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:58.0048346Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:58.0048706Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:58.0049041Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:58.0049379Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:58.0049662Z #define _STL_PAIR_H 1 2025-05-07T20:26:58.0049918Z #define __cpp_attributes 200809L 2025-05-07T20:26:58.0050263Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:58.0050657Z #define __DELETE_THROW throw() 2025-05-07T20:26:58.0050927Z #define _PTRDIFF_T_ 2025-05-07T20:26:58.0051186Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:58.0051475Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:58.0051744Z #define _IO_LEFT 02 2025-05-07T20:26:58.0051966Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:58.0052224Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:58.0052497Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:58.0052908Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:58.0053430Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:58.0055581Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:58.0055837Z #define _IOS_OUTPUT 2 2025-05-07T20:26:58.0056170Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:58.0056536Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:58.0056840Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:58.0057110Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:58.0057385Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:58.0058149Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:58.0058938Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:58.0059530Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:58.0059838Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:58.0060151Z #define _T_WCHAR_ 2025-05-07T20:26:58.0060370Z #define stdout stdout 2025-05-07T20:26:58.0060707Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:58.0061082Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:58.0061336Z #define __flexarr [] 2025-05-07T20:26:58.0061577Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:58.0061901Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:58.0062239Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:58.0062492Z #define _MATH_H 1 2025-05-07T20:26:58.0062768Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:58.0063101Z #define __S64_TYPE long int 2025-05-07T20:26:58.0063355Z #define __stub_fchflags 2025-05-07T20:26:58.0063621Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:58.0063909Z #define __SQUAD_TYPE long int 2025-05-07T20:26:58.0064178Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:58.0064441Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:58.0064703Z #define NL_NMAX INT_MAX 2025-05-07T20:26:58.0065099Z #define _BITS_TIME_H 1 2025-05-07T20:26:58.0065385Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:58.0065720Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:58.0066058Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:58.0066407Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:58.0066803Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:58.0067156Z #define __CHAR_BIT__ 8 2025-05-07T20:26:58.0067422Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.0067732Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:58.0068022Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:58.0068287Z #define FP_NAN 0 2025-05-07T20:26:58.0068548Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:58.0068981Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:58.0069461Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:58.0069850Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:58.0070139Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:58.0070396Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:58.0070650Z #define __SM_80_RT_H__ 2025-05-07T20:26:58.0070875Z #define _NEW 2025-05-07T20:26:58.0071096Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:58.0071374Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:58.0071738Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:58.0072129Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:58.0072372Z #define __USE_ANSI 1 2025-05-07T20:26:58.0072660Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:58.0073049Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:58.0073397Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:58.0073693Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:58.0073972Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:58.0074248Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:58.0074534Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:58.0074945Z #define PIPE_BUF 4096 2025-05-07T20:26:58.0075259Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:58.0075618Z #define ADJ_TICK 0x4000 2025-05-07T20:26:58.0075929Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:58.0076261Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:58.0076529Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:58.0076848Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:58.0077294Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:58.0077815Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:58.0078177Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:58.0078439Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:58.0078710Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:58.0078996Z #define __cpp_static_assert 201411L 2025-05-07T20:26:58.0079337Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:58.0079683Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:58.0079968Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:58.0080246Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:58.0080542Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:58.0080827Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:58.0081129Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.0081485Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:58.0081837Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:58.0082120Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:58.0082429Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.0082787Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:58.0083139Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:58.0083428Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:58.0083726Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:58.0084137Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:58.0084466Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:58.0094930Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:58.0095412Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:58.0095763Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:58.0096056Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:58.0096347Z #define __GCC_IEC_559 2 2025-05-07T20:26:58.0096644Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:58.0096978Z #define _IO_flockfile(_fp) 2025-05-07T20:26:58.0097248Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:58.0097523Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:58.0097783Z #define _IOFBF 0 2025-05-07T20:26:58.0098012Z #define __USE_BSD 1 2025-05-07T20:26:58.0098246Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:58.0098527Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:58.0098794Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:58.0099055Z #define _IO_NO_WRITES 8 2025-05-07T20:26:58.0099320Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:58.0099672Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:58.0100014Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:58.0100320Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:58.0100641Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:58.0100928Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:58.0101191Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:58.0101457Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:58.0101764Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:58.0102139Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:58.0102500Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:58.0102805Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:58.0103105Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:58.0103432Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:58.0103859Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:58.0104149Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:58.0104424Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:58.0104691Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:58.0105271Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:58.0105876Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:58.0106223Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:58.0106539Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:58.0106832Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:58.0107111Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:58.0107380Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:58.0107673Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:58.0108005Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:58.0108313Z #define RAND_MAX 2147483647 2025-05-07T20:26:58.0108598Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:58.0108918Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.0109236Z #define __SM_90_RT_H__ 2025-05-07T20:26:58.0109483Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:58.0109742Z #define __COMPAR_FN_T 2025-05-07T20:26:58.0109989Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:58.0110255Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:58.0110724Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:58.0111230Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:58.0111575Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:58.0111927Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:58.0112226Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:58.0112563Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:58.0112874Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:58.0113463Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:58.0114005Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:58.0114337Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:58.0114607Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:58.0114906Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:58.0115208Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:58.0115470Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:58.0115738Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:58.0116030Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:58.0116301Z #define __u_char_defined 2025-05-07T20:26:58.0116620Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:58.0116977Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:58.0117235Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:58.0117491Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:58.0117774Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:58.0118205Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:58.0118621Z #define FP_INFINITE 1 2025-05-07T20:26:58.0118986Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:58.0119395Z #define _IO_pid_t __pid_t 2025-05-07T20:26:58.0119644Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:58.0119900Z #define __LEAF , __leaf__ 2025-05-07T20:26:58.0120140Z #define PATH_MAX 4096 2025-05-07T20:26:58.0120387Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:58.0120716Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:58.0121029Z #define _LIMITS_H___ 2025-05-07T20:26:58.0121256Z #define __size_t 2025-05-07T20:26:58.0121556Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:58.0122134Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:58.0122696Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:58.0123005Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:58.0123429Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:58.0123695Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:58.0124048Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:58.0124444Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:58.0124744Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:58.0125343Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:58.0125718Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:58.0126010Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:58.0126294Z #define __INT8_C(c) c 2025-05-07T20:26:58.0126552Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:58.0126854Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:58.0127121Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:58.0127377Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:58.0127632Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:58.0127915Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:58.0128238Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.0128564Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:58.0128843Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:58.0129112Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:58.0129378Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:58.0129693Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:58.0129998Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:58.0130350Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:58.0130727Z #define NFDBITS __NFDBITS 2025-05-07T20:26:58.0130989Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:58.0131275Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:58.0131604Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:58.0131930Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:58.0132188Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:58.0132597Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:58.0132913Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:58.0133328Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:58.0133748Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:58.0134116Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:58.0134414Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:58.0134729Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:58.0135101Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:58.0135444Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:58.0135760Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:58.0136150Z #define __daddr_t_defined 2025-05-07T20:26:58.0136411Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:58.0136685Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:58.0137006Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:58.0137523Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:58.0138012Z #define _ACRTIMP 2025-05-07T20:26:58.0138236Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:58.0138510Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:58.0138807Z #define _IOS_BIN 128 2025-05-07T20:26:58.0139160Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:58.0139580Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.0139860Z #define UNDERFLOW 4 2025-05-07T20:26:58.0140086Z #define NAME_MAX 255 2025-05-07T20:26:58.0140335Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:58.0140614Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:58.0140896Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:58.0141199Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:58.0141576Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:58.0141958Z #define __ptr_t void * 2025-05-07T20:26:58.0142206Z #define M_E 2.7182818284590452354 2025-05-07T20:26:58.0142588Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:58.0142863Z #define __USE_ISOCXX11 1 2025-05-07T20:26:58.0143132Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:58.0143454Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:58.0143758Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:58.0144034Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:58.0144328Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:58.0144647Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:58.0144906Z #define __linux 1 2025-05-07T20:26:58.0145142Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:58.0145423Z #define cudaDeviceMask 0xff 2025-05-07T20:26:58.0145693Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:58.0146017Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:58.0146328Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:58.0146618Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:58.0146934Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:58.0147250Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:58.0147554Z #define _BITS_TYPES_H 1 2025-05-07T20:26:58.0147841Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:58.0148184Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:58.0148496Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:58.0148775Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:58.0149075Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:58.0149375Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:58.0150148Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:58.0150951Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:58.0151238Z #define __unix 1 2025-05-07T20:26:58.0151467Z #define MATH_ERRNO 1 2025-05-07T20:26:58.0151721Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:58.0152006Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:58.0152377Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:58.0152682Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:58.0152978Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:58.0153271Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:58.0153737Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:58.0154197Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:58.0154500Z #define CUDARTAPI_CDECL 2025-05-07T20:26:58.0154766Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:58.0155047Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:58.0155337Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:58.0155606Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:58.0155851Z #define __SIZE_T 2025-05-07T20:26:58.0156104Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:58.0156426Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:58.0156729Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:58.0156998Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:58.0157275Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:58.0157666Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:58.0158090Z #define __WAIT_STATUS void * 2025-05-07T20:26:58.0158363Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:58.0158643Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:58.0158912Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:58.0159388Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:58.0159744Z #define __WINT_MIN__ 0U 2025-05-07T20:26:58.0160314Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:58.0160943Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:58.0161241Z #define WUNTRACED 2 2025-05-07T20:26:58.0161472Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:58.0161744Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:58.0162033Z #define NZERO 20 2025-05-07T20:26:58.0162460Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:58.0162728Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:58.0163015Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:58.0163300Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:58.0163545Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:58.0163826Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:58.0164098Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:58.0164370Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:58.0164631Z #define EXIT_FAILURE 1 2025-05-07T20:26:58.0164868Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:58.0165128Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:58.0165385Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:58.0165636Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:58.0165934Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:58.0166285Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:58.0166640Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:58.0166942Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:58.0167184Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:58.0167457Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:58.0167750Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:58.0168045Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:58.0168330Z #define SEEK_DATA 3 2025-05-07T20:26:58.0168560Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:58.0168851Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:58.0169256Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:58.0169636Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:58.0169880Z #define __INT64_C(c) c ## L 2025-05-07T20:26:58.0170141Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:58.0170467Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:58.0170785Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:58.0171052Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:58.0171470Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:58.0171774Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:58.0172017Z #define __INT_WCHAR_T_H 2025-05-07T20:26:58.0172248Z #define WSTOPPED 2 2025-05-07T20:26:58.0172481Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:58.0172753Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:58.0172998Z #define FP_NORMAL 4 2025-05-07T20:26:58.0173296Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:58.0173570Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:58.0173794Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:58.0174043Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:58.0174312Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:58.0174573Z #define cudaTextureType1D 0x01 2025-05-07T20:26:58.0174834Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:58.0175085Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:58.0175345Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:58.0175630Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:58.0176095Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:58.0176539Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:58.0176795Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:58.0177043Z #define _POSIX_SOURCE 1 2025-05-07T20:26:58.0177278Z #define cudaTextureType2D 0x02 2025-05-07T20:26:58.0177530Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:58.0177791Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:58.0178097Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:58.0178356Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:58.0178666Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:58.0178991Z #define cudaTextureType3D 0x03 2025-05-07T20:26:58.0179248Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:58.0179501Z #define CLOCK_REALTIME 0 2025-05-07T20:26:58.0179751Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:58.0180013Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:58.0180309Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:58.0180582Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:58.0180937Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:58.0181217Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:58.0181479Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:58.0181772Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:58.0182057Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:58.0182330Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:58.0182574Z #define __GLIBC__ 2 2025-05-07T20:26:58.0182778Z #define __END_DECLS } 2025-05-07T20:26:58.0183013Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:58.0183366Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:58.0183724Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:58.0183973Z #define WCONTINUED 8 2025-05-07T20:26:58.0184196Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:58.0184440Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:58.0184704Z #define _ALLOCA_H 1 2025-05-07T20:26:58.0184940Z #define __host__ __location__(host) 2025-05-07T20:26:58.0185351Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:58.0185774Z #define __SLONG32_TYPE int 2025-05-07T20:26:58.0186034Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:58.0186304Z #define _SYS_SELECT_H 1 2025-05-07T20:26:58.0186535Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:58.0186777Z #define _IOS_NOCREATE 32 2025-05-07T20:26:58.0187021Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:58.0187287Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:58.0187570Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:58.0187845Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:58.0188117Z #define __global__ __location__(global) 2025-05-07T20:26:58.0188395Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:58.0188643Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:58.0188903Z #define __DBL_DIG__ 15 2025-05-07T20:26:58.0189130Z #define TIME_UTC 1 2025-05-07T20:26:58.0189427Z #define __FLT32_DIG__ 6 2025-05-07T20:26:58.0189742Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:58.0190127Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:58.0190550Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:58.0190871Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:58.0191158Z #define _G_BUFSIZ 8192 2025-05-07T20:26:58.0191457Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:58.0191819Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:58.0192134Z #define __cudaCDP2GetDevice 2025-05-07T20:26:58.0192463Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:58.0192744Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:58.0192981Z #define __GXX_WEAK__ 1 2025-05-07T20:26:58.0193230Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.0193526Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:58.0193775Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:58.0194064Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:58.0194397Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:58.0194668Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:58.0194947Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:58.0195233Z #define _G_config_h 1 2025-05-07T20:26:58.0195502Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:58.0195845Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:58.0196149Z #define _GCC_WCHAR_T 2025-05-07T20:26:58.0196373Z #define TMP_MAX 238328 2025-05-07T20:26:58.0196600Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:58.0196862Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:58.0197119Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.0197386Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:58.0197655Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:58.0197928Z #define _IO_SKIPWS 01 2025-05-07T20:26:58.0198314Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:58.0198755Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:58.0199140Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:58.0199458Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:58.0199816Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:58.0200175Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:58.0200530Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:58.0200777Z #define le32toh(x) (x) 2025-05-07T20:26:58.0201009Z #define _SIZE_T_DEFINED 2025-05-07T20:26:58.0201258Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:58.0201587Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:58.0201932Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:58.0202325Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:58.0202722Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:58.0202988Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:58.0203249Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:58.0203511Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:58.0203796Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:58.0204303Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:58.0204794Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:58.0205096Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:58.0205439Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:58.0205751Z #define _WCHAR_T_ 2025-05-07T20:26:58.0205974Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:58.0206387Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:58.0206770Z #define RTSIG_MAX 32 2025-05-07T20:26:58.0206991Z #define _STDDEF_H 2025-05-07T20:26:58.0207221Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:58.0207490Z #define _VA_LIST_DEFINED 2025-05-07T20:26:58.0207741Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:58.0208066Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:58.0208536Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:58.0208869Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:58.0209153Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:58.0209608Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:58.0210128Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:58.0210489Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:58.0210808Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:58.0211118Z #define __unix__ 1 2025-05-07T20:26:58.0211355Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.0211641Z #define __INT_WIDTH__ 32 2025-05-07T20:26:58.0211892Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:58.0212136Z #define _IONBF 2 2025-05-07T20:26:58.0212571Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:58.0213415Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:58.0261417Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:58.0261711Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:58.0261979Z #define __UINT16_C(c) c 2025-05-07T20:26:58.0262232Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:58.0262566Z #define STA_DEL 0x0020 2025-05-07T20:26:58.0262858Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:26:58.0263114Z #define __id_t_defined 2025-05-07T20:26:58.0263379Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:58.0263822Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:58.0264242Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:58.0264500Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:58.0264752Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:58.0264995Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:58.0265245Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:58.0265517Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:58.0266041Z #define SING 2 2025-05-07T20:26:58.0266249Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:58.0266509Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:58.0266798Z #define cudaStreamDefault 0x00 2025-05-07T20:26:58.0267133Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:58.0267499Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:58.0267770Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:58.0268026Z #define __gnu_linux__ 1 2025-05-07T20:26:58.0268265Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:58.0268580Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:58.0268822Z #define MAX_INPUT 255 2025-05-07T20:26:58.0269058Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:58.0269381Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:58.0269748Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:58.0270059Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:58.0270327Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:58.0270719Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:58.0271136Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:58.0271462Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:58.0271814Z #define _Mfloat_ float 2025-05-07T20:26:58.0272071Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:58.0272377Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:58.0272661Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:58.0273135Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:58.0273648Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.0273934Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:58.0274266Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:58.0274623Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:58.0275061Z #define __USE_ISOC11 1 2025-05-07T20:26:58.0275305Z #define _BSD_SIZE_T_ 2025-05-07T20:26:58.0275542Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:58.0275797Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:58.0276056Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:58.0276358Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:58.0276674Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:58.0276987Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:58.0277317Z #define __THROW throw () 2025-05-07T20:26:58.0277568Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:58.0277858Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:58.0278213Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:58.0278559Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:58.0278832Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:58.0279097Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:58.0279358Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:58.0279627Z #define L_tmpnam 20 2025-05-07T20:26:58.0279857Z #define ___int_wchar_t_h 2025-05-07T20:26:58.0280196Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:58.0280573Z #define isascii(c) __isascii (c) 2025-05-07T20:26:58.0280833Z #define _T_PTRDIFF 2025-05-07T20:26:58.0281143Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:58.0281493Z #define toascii(c) __toascii (c) 2025-05-07T20:26:58.0281751Z #define __GNUC__ 11 2025-05-07T20:26:58.0282006Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:58.0282298Z #define __GXX_RTTI 1 2025-05-07T20:26:58.0282523Z #define __pie__ 2 2025-05-07T20:26:58.0282736Z #define __MMX__ 1 2025-05-07T20:26:58.0282953Z #define __cudaCDP2Malloc 2025-05-07T20:26:58.0283209Z #define __timespec_defined 1 2025-05-07T20:26:58.0283471Z #define L_ctermid 9 2025-05-07T20:26:58.0283703Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:58.0284009Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:58.0284404Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:58.0284853Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:58.0285125Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:58.0285421Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:58.0285729Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:58.0286041Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:58.0286308Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:58.0286749Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:58.0287484Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:58.0288085Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:58.0288390Z #define __USE_SVID 1 2025-05-07T20:26:58.0288645Z #define __constant__ __location__(constant) 2025-05-07T20:26:58.0288960Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:58.0289265Z #define __device__ __location__(device) 2025-05-07T20:26:58.0289597Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:58.0289913Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:58.0290181Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:58.0290461Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:58.0290801Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:58.0291164Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:58.0291456Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:58.0291818Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:58.0292195Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:58.0292448Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:58.0292831Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:58.0293414Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:58.0293724Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:58.0294087Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:58.0294371Z #define NGROUPS_MAX 65536 2025-05-07T20:26:58.0294643Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:58.0294925Z #define __USE_ISOC95 1 2025-05-07T20:26:58.0295165Z #define _TIME_H 1 2025-05-07T20:26:58.0295512Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:58.0295846Z #define __USE_ISOC99 1 2025-05-07T20:26:58.0296199Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:58.0296553Z #define HOST_NAME_MAX 64 2025-05-07T20:26:58.0296798Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:58.0297053Z #define _IOS_ATEND 4 2025-05-07T20:26:58.0297282Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:58.0297598Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:58.0297996Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:58.0298327Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:58.0298601Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:58.0298919Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:58.0299224Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:58.0299473Z #define _STDIO_H 1 2025-05-07T20:26:58.0299872Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:58.0300324Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:58.0300677Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:58.0301047Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:58.0301324Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:58.0301586Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:58.0301850Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:58.0302136Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:58.0302426Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.0302734Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:58.0302997Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:58.0303267Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:58.0303568Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:58.0303938Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:58.0304213Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:58.0304569Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:58.0304930Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:58.0305161Z #define __USE_XOPEN 1 2025-05-07T20:26:58.0305405Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:58.0305838Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:58.0306319Z #define __USE_XOPEN2K 1 2025-05-07T20:26:58.0306553Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:58.0306823Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:58.0307110Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:58.0307371Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:58.0307881Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:58.0308397Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:58.0308674Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:58.0309027Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:58.0309406Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:58.0309773Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:58.0310161Z #define __END_NAMESPACE_C99 2025-05-07T20:26:58.0310427Z #define __glibcxx_integral_traps true 2025-05-07T20:26:58.0310706Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:58.0310949Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:58.0311202Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:58.0311463Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:58.0311704Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:58.0311989Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:58.0312282Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:58.0312637Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:58.0313092Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:58.0313368Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:58.0313617Z #define _IO_UNITBUF 020000 2025-05-07T20:26:58.0313862Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:58.0314126Z #define __FD_SETSIZE 1024 2025-05-07T20:26:58.0314367Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:58.0314638Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:58.0314977Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:58.0315326Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:58.0315579Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:58.0315887Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:58.0316200Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:58.0316461Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:58.0316770Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:58.0317098Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:58.0317374Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:58.0317700Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:58.0317987Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:58.0318248Z #define __USE_POSIX199506 1 2025-05-07T20:26:58.0318494Z #define _FEATURES_H 1 2025-05-07T20:26:58.0318727Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:58.0319115Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:58.0319515Z #define __stub_getmsg 2025-05-07T20:26:58.0319742Z #define _IO_FIXED 010000 2025-05-07T20:26:58.0320017Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:58.0320318Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:58.0320581Z #define __stub_setlogin 2025-05-07T20:26:58.0320817Z #define __stub_fattach 2025-05-07T20:26:58.0321046Z #define __cplusplus 201703L 2025-05-07T20:26:58.0321307Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:58.0321582Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:58.0321828Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:58.0322104Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:58.0322666Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:58.0323174Z #define _IO_INTERNAL 010 2025-05-07T20:26:58.0323414Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:58.0323741Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:58.0324093Z #define __dev_t_defined 2025-05-07T20:26:58.0324324Z #define __DEPRECATED 1 2025-05-07T20:26:58.0324549Z #define __S32_TYPE int 2025-05-07T20:26:58.0324796Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:58.0325081Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:58.0325332Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:58.0325581Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:58.0326169Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:58.0326796Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:58.0327106Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:58.0327437Z #define OVERFLOW 3 2025-05-07T20:26:58.0327690Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:58.0328086Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:58.0328374Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.0328699Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:58.0329022Z #define __SSE2_MATH__ 1 2025-05-07T20:26:58.0329261Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:58.0329557Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.0329855Z #define _IO_STDIO_H 2025-05-07T20:26:58.0330102Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:58.0330412Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:58.0330759Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:58.0331082Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.0331418Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:58.0331707Z #define __amd64 1 2025-05-07T20:26:58.0331944Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:58.0332343Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:58.0332628Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:58.0332913Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:58.0333297Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:58.0333552Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:58.0333844Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:58.0334105Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:58.0334346Z #define __bounded 2025-05-07T20:26:58.0334575Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:58.0334857Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:58.0335128Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:58.0335387Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:58.0335654Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.0335985Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:58.0336412Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:58.0336813Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:58.0337086Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:58.0337414Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:58.0337746Z #define STA_PLL 0x0001 2025-05-07T20:26:58.0337986Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:58.0338242Z #define __GNUG__ 11 2025-05-07T20:26:58.0338466Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:58.0338723Z #define _T_WCHAR 2025-05-07T20:26:58.0338950Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:58.0339245Z #define __specialization_static 2025-05-07T20:26:58.0339540Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:58.0339838Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:58.0340091Z #define cudaArraySparse 0x40 2025-05-07T20:26:58.0340353Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:58.0340592Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:58.0340867Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:58.0341163Z #define _WCHAR_T 2025-05-07T20:26:58.0341375Z #define __cudaCDP2Free 2025-05-07T20:26:58.0342095Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:58.0342761Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:58.0343167Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:58.0343591Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:58.0343860Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:58.0344116Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:58.0344439Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:58.0344787Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:58.0345025Z #define __NO_CTYPE 1 2025-05-07T20:26:58.0345246Z #define __stub_bdflush 2025-05-07T20:26:58.0345593Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:58.0346038Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:58.0346369Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:58.0346629Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:58.0346896Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:58.0347191Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:58.0347477Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:58.0347808Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:58.0348142Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:58.0348416Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:58.0348694Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:58.0349027Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:58.0349366Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:58.0349635Z #define _IO_STDIO 040000 2025-05-07T20:26:58.0349952Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:58.0350325Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:58.0350716Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:58.0351012Z #define _PTRDIFF_T 2025-05-07T20:26:58.0351229Z #define _MOVE_H 1 2025-05-07T20:26:58.0351448Z #define __cpp_hex_float 201603L 2025-05-07T20:26:58.0351704Z #define ADJ_TAI 0x0080 2025-05-07T20:26:58.0351929Z #define __ptrvalue 2025-05-07T20:26:58.0352145Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:58.0352393Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:58.0352672Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:58.0352964Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:58.0353218Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:58.0353503Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:58.0353892Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:58.0354265Z #define __USE_GNU 1 2025-05-07T20:26:58.0354494Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:58.0354766Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:58.0355031Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:58.0355428Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:58.0355839Z #define WEXITED 4 2025-05-07T20:26:58.0356072Z #define _IO_NO_READS 4 2025-05-07T20:26:58.0356374Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:58.0356722Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:58.0356995Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:58.0357287Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:58.0357598Z #define __uid_t_defined 2025-05-07T20:26:58.0357848Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:58.0358140Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:58.0358417Z #define WNOHANG 1 2025-05-07T20:26:58.0358664Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:58.0358969Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:58.0359457Z #define cudaEventDefault 0x00 2025-05-07T20:26:58.0359868Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:58.0360210Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:58.0360589Z #define __x86_64 1 2025-05-07T20:26:58.0360819Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:58.0361203Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:58.0361671Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:58.0362164Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:58.0362596Z #define __PTRDIFF_T 2025-05-07T20:26:58.0362911Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:58.0363281Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:58.0363548Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.0363826Z #define _Mlong_double_ long double 2025-05-07T20:26:58.0364106Z #define __cpp_lambdas 200907L 2025-05-07T20:26:58.0364353Z #define _IO_DEC 020 2025-05-07T20:26:58.0364567Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:58.0364838Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:58.0365135Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:58.0365403Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:58.0365665Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:58.0365962Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:58.0366277Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:58.0366543Z #define _ANSI_STDDEF_H 2025-05-07T20:26:58.0366807Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:58.0367114Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:58.0367475Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:58.0367847Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:58.0368123Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:58.0368403Z #define __cpp_template_auto 201606L 2025-05-07T20:26:58.0368755Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:58.0369118Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:58.0369562Z #define __key_t_defined 2025-05-07T20:26:58.0369821Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:58.0370192Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:58.0370649Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:58.0371006Z #define __GNUC_VA_LIST 2025-05-07T20:26:58.0371336Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:58.0371709Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:58.0371967Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:58.0372249Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:58.0372535Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:58.0372779Z #define __WCOREFLAG 0x80 2025-05-07T20:26:58.0373110Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:58.0373413Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:58.0373682Z #define __LP64__ 1 2025-05-07T20:26:58.0373921Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:58.0374241Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:58.0374519Z #define _IO_off64_t __off64_t 2025-05-07T20:26:58.0374777Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.0375031Z #define __time_t_defined 1 2025-05-07T20:26:58.0375280Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:58.0382906Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:58.0383330Z #define __USE_UNIX98 1 2025-05-07T20:26:58.0383578Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:58.0383850Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:58.0384118Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:58.0384415Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:58.0384723Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:58.0384972Z #define SEEK_CUR 1 2025-05-07T20:26:58.0385204Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.0385472Z #define _ASSERT_H 1 2025-05-07T20:26:58.0386055Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:58.0386808Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:58.0387079Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:58.0387325Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:58.0387581Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:58.0387848Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:58.0388211Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:58.0388605Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:58.0389249Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:58.0389885Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:58.0390184Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:58.0390525Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:58.0390891Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:58.0391163Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:58.0391435Z #define cudaArrayDefault 0x00 2025-05-07T20:26:58.0391714Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:58.0391996Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:58.0392264Z #define TLOSS 5 2025-05-07T20:26:58.0392476Z #define __ssize_t_defined 2025-05-07T20:26:58.0392724Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:26:58.0392991Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:58.0393272Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:58.0393558Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:58.0393911Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:58.0394287Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:58.0394564Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:58.0394842Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:58.0395144Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:58.0395540Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:58.0395828Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:58.0396171Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:58.0396551Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:58.0396899Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:58.0397128Z #define __cdecl 2025-05-07T20:26:58.0397360Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:58.0397678Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:58.0398081Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:58.0398328Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:58.0398583Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:58.0398872Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:58.0399128Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:58.0399424Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:58.0399750Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:58.0400156Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:58.0400576Z #define ADJ_NANO 0x2000 2025-05-07T20:26:58.0400876Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:58.0401223Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:58.0401500Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:58.0401750Z #define __FLT_DIG__ 6 2025-05-07T20:26:58.0402094Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:58.0402482Z #define __NO_INLINE__ 1 2025-05-07T20:26:58.0402774Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:58.0403115Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:58.0403369Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:58.0403627Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:58.0403912Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:58.0404173Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:58.0404461Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:58.0404749Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:58.0405227Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:58.0405636Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:58.0405972Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:58.0406310Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:58.0406547Z #define MAX_CANON 255 2025-05-07T20:26:58.0406772Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:58.0407019Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:58.0407282Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:58.0407556Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:58.0407859Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:58.0408150Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:58.0408420Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:58.0408735Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:58.0409039Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:58.0409295Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:58.0409595Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:58.0409880Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:58.0410155Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:58.0410455Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:58.0410742Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:58.0410994Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:58.0411237Z #define _SYS_TYPES_H 1 2025-05-07T20:26:58.0411471Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:58.0411728Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:58.0411970Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:58.0412201Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:58.0412469Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:58.0412753Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:58.0413003Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:58.0413366Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:58.0413628Z #define FP_SUBNORMAL 3 2025-05-07T20:26:58.0413961Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:58.0414246Z #define _INITIALIZER_LIST 2025-05-07T20:26:58.0414489Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:58.0414734Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:58.0415005Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:58.0415287Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:58.0415534Z #define _IO_file_flags _flags 2025-05-07T20:26:58.0415797Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:58.0416086Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:58.0416357Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:58.0416624Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:58.0416883Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:58.0417249Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:58.0417636Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:58.0417942Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:58.0418201Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:58.0418455Z #define _BSD_SOURCE 1 2025-05-07T20:26:58.0418696Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:58.0419526Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:58.0420346Z #define __catch(X) catch(X) 2025-05-07T20:26:58.0420606Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:58.0420897Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:58.0421161Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:58.0421412Z #define __STRING(x) #x 2025-05-07T20:26:58.0421655Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:58.0421921Z #define _T_PTRDIFF_ 2025-05-07T20:26:58.0422164Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:58.0422463Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:58.0422734Z #define __unbounded 2025-05-07T20:26:58.0422973Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.0423267Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:58.0423629Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.0423920Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:58.0424197Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:58.0424493Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:58.0424807Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:58.0425108Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:58.0425387Z #define __managed__ __location__(managed) 2025-05-07T20:26:58.0425673Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:58.0426111Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:58.0426517Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:58.0426772Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:58.0427135Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:58.0427525Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:58.0427777Z #define _SYS_SIZE_T_H 2025-05-07T20:26:58.0428065Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:58.0428403Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:58.0428679Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:58.0428967Z #define _CRTIMP 2025-05-07T20:26:58.0429188Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:58.0429489Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:58.0429804Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:58.0430152Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:58.0430549Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.0430859Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:58.0431132Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:58.0431411Z #define __SIZE_T__ 2025-05-07T20:26:58.0431624Z #define __stub_gtty 2025-05-07T20:26:58.0431847Z #define __pid_t_defined 2025-05-07T20:26:58.0432100Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:58.0432488Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.0432802Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:58.0433092Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:58.0433331Z #define __need_clockid_t 2025-05-07T20:26:58.0433603Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:58.0433862Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:58.0434171Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:58.0434480Z #define _IO_HEX 0100 2025-05-07T20:26:58.0434733Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:58.0435062Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:58.0435361Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:58.0435625Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:58.0436023Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:58.0436449Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:58.0436753Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:58.0437045Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:58.0437336Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:58.0437615Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:58.0437869Z #define __stub_sstk 2025-05-07T20:26:58.0438096Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:58.0438400Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:58.0438723Z #define __wur 2025-05-07T20:26:58.0438959Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:58.0439249Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:58.0439471Z #define _IO_OCT 040 2025-05-07T20:26:58.0439691Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:58.0439949Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:58.0440192Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:58.0440470Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:58.0440775Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:58.0441029Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:58.0441388Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:58.0441765Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:58.0442101Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:58.0442356Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:58.0442633Z #define __off64_t_defined 2025-05-07T20:26:58.0442885Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:58.0443142Z #define __FLT128_DIG__ 33 2025-05-07T20:26:58.0443392Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:58.0443678Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:58.0443930Z #define __INT32_C(c) c 2025-05-07T20:26:58.0444161Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:58.0444426Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:58.0444692Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:58.0444955Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:58.0445193Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:58.0445436Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:58.0445730Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:58.0446051Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:58.0446304Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:58.0446545Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:58.0446820Z #define __have_pthread_attr_t 1 2025-05-07T20:26:58.0447082Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:58.0447470Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:58.0447894Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:58.0448183Z #define __cudaCDP2EventRecord 2025-05-07T20:26:58.0448441Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:58.0448679Z #define htole32(x) (x) 2025-05-07T20:26:58.0449066Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:58.0449523Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:58.0449824Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:58.0450160Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:58.0450539Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:58.0450967Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:58.0451324Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:58.0451637Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:58.0451884Z #define cudaArrayLayered 0x01 2025-05-07T20:26:58.0452210Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:58.0452576Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:58.0452859Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:58.0453201Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:58.0453471Z #define unix 1 2025-05-07T20:26:58.0453683Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:58.0453928Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:58.0454179Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:58.0454454Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:58.0454744Z #define __USE_POSIX 1 2025-05-07T20:26:58.0454978Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:58.0455268Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:58.0455579Z #define __THROWNL throw () 2025-05-07T20:26:58.0455824Z #define __cpp_rtti 199711L 2025-05-07T20:26:58.0456075Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:58.0456352Z #define __PMT(args) args 2025-05-07T20:26:58.0456606Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.0456950Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:58.0457294Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:58.0457588Z #define _SIZE_T_DECLARED 2025-05-07T20:26:58.0457833Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:58.0458086Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:58.0458623Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:58.0459453Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:58.0459747Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:58.0459999Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:58.0460309Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:58.0460784Z #define _WCHAR_T_H 2025-05-07T20:26:58.0460999Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:58.0461233Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:58.0461465Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:58.0461705Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:58.0461965Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:58.0462214Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:58.0462466Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:58.0462735Z #define __ELF__ 1 2025-05-07T20:26:58.0462955Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:58.0463227Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:58.0463481Z #define STA_INS 0x0010 2025-05-07T20:26:58.0463715Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:58.0464063Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:58.0464408Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:58.0464654Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:58.0464924Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.0465235Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:58.0465523Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:58.0465800Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:58.0466086Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:58.0466407Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:58.0466799Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:58.0467135Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:58.0467609Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:58.0468151Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:58.0468462Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:58.0468699Z #define __FLT_RADIX__ 2 2025-05-07T20:26:58.0468944Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:58.0469294Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:58.0469783Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:58.0470043Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:58.0470306Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:58.0470587Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:58.0470842Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:58.0471112Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:58.0471378Z #define WORD_BIT 32 2025-05-07T20:26:58.0471589Z #define _IO_USER_BUF 1 2025-05-07T20:26:58.0471817Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:58.0472070Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:58.0472359Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:58.0472639Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:58.0472903Z #define __long_double_t long double 2025-05-07T20:26:58.0473171Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:58.0473425Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:58.0473978Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:58.0474551Z #define __k8 1 2025-05-07T20:26:58.0474857Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:58.0475300Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:58.0475669Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:58.0475965Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:58.0476242Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:58.0476519Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:58.0476786Z #define __blksize_t_defined 2025-05-07T20:26:58.0477037Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:58.0477285Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:58.0477569Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:58.0477857Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:58.0478124Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:58.0478409Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:58.0478661Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:58.0479081Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:58.0479843Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:58.0480365Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:58.0480641Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:58.0480887Z #define SEEK_SET 0 2025-05-07T20:26:58.0481108Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:58.0481373Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:26:58.0481725Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:58.0482099Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:58.0482377Z #define __cudaCDP2GetLastError 2025-05-07T20:26:58.0482638Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:58.0482884Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:58.0483343Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:58.0483840Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:58.0484101Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:58.0484349Z #define __stub_sigreturn 2025-05-07T20:26:58.0484727Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:58.0485148Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:58.0485406Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:58.0485648Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:58.0485901Z #define CLOCK_TAI 11 2025-05-07T20:26:58.0486135Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:58.0486408Z #define __restrict_arr 2025-05-07T20:26:58.0486658Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:58.0486991Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:58.0487806Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:58.0488577Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:58.0488930Z #define __USE_MISC 1 2025-05-07T20:26:58.0489168Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:58.0489450Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:58.0489697Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:58.0489923Z #define __LDBL_DIG__ 18 2025-05-07T20:26:58.0490151Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:58.0490415Z #define __malloc_and_calloc_defined 2025-05-07T20:26:58.0490687Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:58.0490948Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:58.0491210Z #define __x86_64__ 1 2025-05-07T20:26:58.0491419Z #define _SIZE_T_ 2025-05-07T20:26:58.0492406Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:58.0493537Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:58.0493819Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:58.0494106Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:58.0494420Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:58.0494716Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:58.0494985Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:58.0495294Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:58.0495637Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:58.0495961Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:58.0496578Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:58.0497248Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:58.0497689Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:58.0498022Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:58.0498284Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:58.0498577Z #define STA_FLL 0x0008 2025-05-07T20:26:58.0498930Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:58.0499254Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:58.0499539Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.0499857Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:58.0500138Z #define __stub_revoke 2025-05-07T20:26:58.0500365Z #define __timer_t_defined 1 2025-05-07T20:26:58.0500646Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:58.0501121Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:58.0501536Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:58.0506392Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:58.0506706Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:58.0507012Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:58.0507300Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:58.0507620Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:58.0507945Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:58.0508197Z #define _IO_off_t __off_t 2025-05-07T20:26:58.0508435Z #define __FLT64_DIG__ 15 2025-05-07T20:26:58.0508803Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:58.0509205Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:58.0509492Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.0509835Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:58.0510133Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:58.0510397Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:58.0510658Z #define NULL __null 2025-05-07T20:26:58.0510912Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:58.0511351Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:58.0511639Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:58.0511906Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.0512164Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:58.0512404Z #define FP_ZERO 2 2025-05-07T20:26:58.0512621Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:58.0512931Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:58.0513278Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.0513556Z #define __WCHAR_T__ 2025-05-07T20:26:58.0513775Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:58.0514127Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:58.0514571Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:58.0514907Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:58.0515189Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:58.0515506Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:58.0515835Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:58.0516177Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:58.0516481Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:58.0516721Z #define _SIGSET_H_types 1 2025-05-07T20:26:58.0516983Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:58.0517279Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:58.0517608Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:58.0517943Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:58.0518244Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:58.0518581Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:58.0518903Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:58.0519220Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:58.0519604Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:58.0519962Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:58.0520230Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:58.0520599Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:58.0520864Z #define STA_MODE 0x4000 2025-05-07T20:26:58.0521112Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:58.0521404Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:58.0521695Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:58.0521999Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:58.0522272Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:58.0522649Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:58.0522952Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:58.0523223Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:58.0523509Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:58.0523776Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.0524058Z #define __SEG_FS 1 2025-05-07T20:26:58.0524275Z #define _IO_size_t size_t 2025-05-07T20:26:58.0524522Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:58.0524795Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:58.0525052Z #define __stub_lchmod 2025-05-07T20:26:58.0525293Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:58.0525564Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.0525851Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:58.0526107Z #define __SEG_GS 1 2025-05-07T20:26:58.0526412Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:58.0526769Z #define _IOS_APPEND 8 2025-05-07T20:26:58.0527001Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:58.0527257Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:58.0527506Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:58.0527778Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:58.0528043Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:58.0528305Z #define htole16(x) (x) 2025-05-07T20:26:58.0528549Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:58.0528836Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:58.0529089Z #define __INT16_TYPE__ short int 2025-05-07T20:26:58.0529448Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:58.0529756Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:58.0530055Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:58.0530368Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:58.0530662Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:58.0530898Z #define __WCLONE 0x80000000 2025-05-07T20:26:58.0531140Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:58.0531380Z #define SEEK_HOLE 4 2025-05-07T20:26:58.0531591Z #define TIMER_ABSTIME 1 2025-05-07T20:26:58.0531820Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:58.0532068Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:58.0532395Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:58.0532767Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.0533155Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:58.0533430Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:58.0533723Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:58.0534014Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:58.0534319Z #define _LINUX_LIMITS_H 2025-05-07T20:26:58.0534547Z #define linux 1 2025-05-07T20:26:58.0534755Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:58.0535020Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:58.0535314Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:58.0535572Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:58.0535838Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:58.0536169Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:58.0536490Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:58.0536749Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.0537018Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:58.0537272Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:58.0537503Z #define htole64(x) (x) 2025-05-07T20:26:58.0537737Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:58.0538041Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:58.0538355Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:58.0539090Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:58.0539751Z #define __USE_POSIX2 1 2025-05-07T20:26:58.0539976Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:58.0540225Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:58.0540485Z #define __WALL 0x40000000 2025-05-07T20:26:58.0540722Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:58.0540967Z #define _XLOCALE_H 1 2025-05-07T20:26:58.0541202Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:58.0541457Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:58.0541722Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:58.0541986Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:58.0542263Z #define __EXCEPTIONS 1 2025-05-07T20:26:58.0542504Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:58.0542868Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:58.0543238Z #define __WORDSIZE 64 2025-05-07T20:26:58.0543467Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:58.0543700Z #define _STL_RELOPS_H 1 2025-05-07T20:26:58.0543933Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:58.0544185Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:58.0544454Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:58.0544725Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:58.0544976Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:58.0545434Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:58.0546048Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:58.0546492Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:58.0546789Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:58.0547061Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:58.0547361Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:58.0547654Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:58.0548020Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:58.0548408Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:58.0548770Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:58.0549030Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:58.0549291Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:58.0549650Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:58.0550023Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:58.0550302Z #define _STRING_H 1 2025-05-07T20:26:58.0550523Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:58.0550779Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:58.0551033Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:58.0551338Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:58.0551651Z #define __code_model_small__ 1 2025-05-07T20:26:58.0551899Z #define _PSTL_CONFIG_H 2025-05-07T20:26:58.0552140Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:58.0552439Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:58.0552735Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:58.0552997Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:58.0553503Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:58.0554016Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:58.0554261Z #define le64toh(x) (x) 2025-05-07T20:26:58.0554483Z #define FILENAME_MAX 4096 2025-05-07T20:26:58.0554777Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:58.0555120Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:58.0555402Z #define L_cuserid 9 2025-05-07T20:26:58.0555611Z #define __ino_t_defined 2025-05-07T20:26:58.0555845Z #define __k8__ 1 2025-05-07T20:26:58.0556060Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:58.0556332Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:58.0556614Z #define __int8_t_defined 2025-05-07T20:26:58.0556856Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:58.0557219Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:58.0557509Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:58.0557800Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:58.0558054Z #define _IOS_TRUNC 16 2025-05-07T20:26:58.0558306Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:58.0558652Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:58.0558966Z #define __HAVE_COLUMN 2025-05-07T20:26:58.0559384Z #define __stub_fdetach 2025-05-07T20:26:58.0559986Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:58.0560561Z #define __pic__ 2 2025-05-07T20:26:58.0560807Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.0561117Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:58.0561381Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:58.0561646Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:58.0561923Z #define __stub_chflags 2025-05-07T20:26:58.0562178Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:58.0562417Z #define __need_IOV_MAX 2025-05-07T20:26:58.0562666Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:58.0562975Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:58.0563266Z #define __cpp_decltype 200707L 2025-05-07T20:26:58.0563534Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:58.0563809Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:58.0564073Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:58.0564364Z #define TTY_NAME_MAX 32 2025-05-07T20:26:58.0564673Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:58.0565051Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.0565429Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:58.0565798Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:58.0566095Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:58.0566512Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:58.0566762Z #define __import__ 2025-05-07T20:26:58.0566984Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:58.0567272Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:58.0567578Z #define __export__ 2025-05-07T20:26:58.0567833Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:58.0568147Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:58.0568486Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:58.0568835Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:58.0569093Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:58.0569348Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:58.0569610Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:58.0569886Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:58.0570218Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:58.0570531Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:58.0570816Z #define WNOWAIT 0x01000000 2025-05-07T20:26:58.0571062Z #define PLOSS 6 2025-05-07T20:26:58.0571285Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:58.0571728Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:58.0572175Z #define EXIT_SUCCESS 0 2025-05-07T20:26:58.0572419Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:58.0572686Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:58.0572957Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:58.0573326Z #define __thread__ __thread 2025-05-07T20:26:58.0573569Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:58.0573828Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:58.0574091Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:58.0574495Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:58.0574922Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:58.0575220Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:58.0575459Z #define __linux__ 1 2025-05-07T20:26:58.0575679Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:58.0575968Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:58.0576407Z #define __S16_TYPE short int 2025-05-07T20:26:58.0576905Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:58.0577440Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:58.0577812Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:58.0578182Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:58.0578446Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:58.0578704Z #define _T_SIZE_ 2025-05-07T20:26:58.0578923Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:58.0579214Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:58.0579513Z #define _PSTL_VERSION 12000 2025-05-07T20:26:58.0579781Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:58.0580086Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:58.0580343Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:58.0580643Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:58.0580941Z #define _IOS_INPUT 1 2025-05-07T20:26:58.0581167Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:58.0581424Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:58.0581708Z #define __INT64_TYPE__ long int 2025-05-07T20:26:58.0581970Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:58.0582230Z #define __shared__ __location__(shared) 2025-05-07T20:26:58.0582504Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:58.0582809Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:58.0583146Z #define __gid_t_defined 2025-05-07T20:26:58.0583402Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:58.0583698Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:58.0584065Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:58.0584456Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:58.0584712Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:58.0585035Z #define ___int_size_t_h 2025-05-07T20:26:58.0585297Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.0585608Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:58.0585969Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:58.0586311Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:58.0586582Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:58.0586849Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:58.0587115Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:58.0587395Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.0587715Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:58.0588029Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:58.0588323Z #define __clock_t_defined 1 2025-05-07T20:26:58.0588572Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:58.0588860Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:58.0589140Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:58.0589384Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:58.0589646Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:58.0589933Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:58.0590216Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:58.0590541Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:58.0590890Z #define __SSE__ 1 2025-05-07T20:26:58.0591106Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:58.0591373Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:58.0591641Z #define _CTYPE_H 1 2025-05-07T20:26:58.0591854Z #define __sigset_t_defined 2025-05-07T20:26:58.0592109Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:58.0592372Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:58.0592613Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:58.0592849Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:58.0593116Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:58.0593358Z #define __SM_70_RT_H__ 2025-05-07T20:26:58.0593587Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:58.0593858Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:58.0594222Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:58.0594536Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:58.0594883Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:58.0595150Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:58.0595243Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:58.0595332Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:58.0595413Z #define __amd64__ 1 2025-05-07T20:26:58.0595499Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:58.0595603Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:58.0595865Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:58.0595964Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:58.0596046Z #define EOF (-1) 2025-05-07T20:26:58.0596141Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:58.0596233Z #define __USE_POSIX199309 1 2025-05-07T20:26:58.0596334Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:58.0596436Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:58.0596530Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:58.0596629Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:58.0596741Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:58.0596831Z #define ____mbstate_t_defined 1 2025-05-07T20:26:58.0596922Z #define STA_NANO 0x2000 2025-05-07T20:26:58.0597015Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:58.0597106Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:58.0597193Z #define _IO_LINKED 0x80 2025-05-07T20:26:58.0597289Z #define __cpp_lib_launder 201606 2025-05-07T20:26:58.0597382Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:58.0597488Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:58.0597581Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:58.0597679Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:58.0597821Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:58.0597927Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:58.0598116Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:58.0598216Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:58.0598308Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:58.0598401Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:58.0598530Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:58.0598652Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.0598852Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:58.0599032Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:58.0599121Z #define __stub_stty 2025-05-07T20:26:58.0599286Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:58.0599371Z #define le16toh(x) (x) 2025-05-07T20:26:58.0599479Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:58.0599649Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:58.0599729Z #define _SIZET_ 2025-05-07T20:26:58.0599826Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:58.0599916Z #define _SVID_SOURCE 1 2025-05-07T20:26:58.0599996Z #define _LP64 1 2025-05-07T20:26:58.0600092Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:58.0600320Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:58.0600430Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:58.0600513Z #define __UINT8_C(c) c 2025-05-07T20:26:58.0600606Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:58.0600700Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:58.0600809Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:58.0600901Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:58.0600993Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:58.0601088Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:58.0601171Z #define CUDARTAPI 2025-05-07T20:26:58.0601255Z #define IOV_MAX 1024 2025-05-07T20:26:58.0601398Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:58.0601495Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:58.0601601Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:58.0601816Z #define __wchar_t__ 2025-05-07T20:26:58.0601971Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:58.0602070Z #define SEEK_END 2 2025-05-07T20:26:58.0602164Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:58.0602337Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:58.0602433Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:58.0602575Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:58.0602667Z #define ____FILE_defined 1 2025-05-07T20:26:58.0602779Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:58.0602874Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:58.0602965Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:58.0603057Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:58.0603301Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:58.0603428Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:58.0603516Z #define _IO_RIGHT 04 2025-05-07T20:26:58.0603663Z #define __END_NAMESPACE_STD 2025-05-07T20:26:58.0603887Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:58.0603982Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:58.0604107Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:58.0604208Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:58.0604312Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:58.0604401Z #define _STDDEF_H_ 2025-05-07T20:26:58.0604572Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:58.0604675Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.0604794Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:58.0604990Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:58.0605109Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.0605253Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:58.0605507Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:58.0605626Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:58.0605741Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:58.0605838Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:58.0605973Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:58.0606079Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:58.0606203Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:58.0606302Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:58.0606476Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:58.0606574Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:58.0606673Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:58.0606769Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:58.0606918Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:58.0607015Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:58.0607109Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:58.0607219Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:58.0607320Z #define P_tmpdir "/tmp" 2025-05-07T20:26:58.0607441Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:58.0607540Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:58.0607644Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:58.0607812Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:58.0607984Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:58.0608084Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:58.0608206Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:58.0608321Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:58.0608432Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:58.0608662Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:58.0608761Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:58.0608880Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:58.0608980Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:58.0609159Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:58.0609257Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:58.0609357Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:58.0609453Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:58.0609540Z #define __FXSR__ 1 2025-05-07T20:26:58.0609622Z #define _SIZE_T 2025-05-07T20:26:58.0609726Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:58.0609844Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:58.0610013Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:58.0610163Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:58.0610264Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:58.0610366Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:58.0610553Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:58.0610755Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:58.0610849Z #define _GXX_NULLPTR_T 2025-05-07T20:26:58.0610982Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:58.0611072Z #define FOPEN_MAX 16 2025-05-07T20:26:58.0611163Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:58.0611288Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:58.0611386Z #define __suseconds_t_defined 2025-05-07T20:26:58.0611473Z #define __off_t_defined 2025-05-07T20:26:58.0611564Z #define stderr stderr 2025-05-07T20:26:58.0611661Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:58.0611773Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:58.0611879Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:58.0611972Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:58.0612375Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:58.0612469Z #define __mode_t_defined 2025-05-07T20:26:58.0612551Z #define _GCC_SIZE_T 2025-05-07T20:26:58.0612737Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.0612848Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:58.0612958Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:58.0613112Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:58.0613211Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:58.0613319Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:58.0613434Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:58.0613549Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:58.0613648Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:58.0613734Z #define __size_t__ 2025-05-07T20:26:58.0613869Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:58.0613971Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:58.0614148Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:58.0614443Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:58.0614590Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:58.0614800Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:58.0619410Z #define _ENDIAN_H 1 2025-05-07T20:26:58.0619544Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:58.0619645Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:58.0619751Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:58.0619831Z #define __try try 2025-05-07T20:26:58.0619933Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:58.0620025Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:58.0620112Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:58.0620375Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:58.0620462Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:58.0620540Z #define __PIC__ 2 2025-05-07T20:26:58.0620656Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:58.0620773Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:58.0620900Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:58.0620997Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:58.0621095Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:58.0621421Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:58.0621519Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:58.0621618Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:58.0621711Z #define _IO_uid_t __uid_t 2025-05-07T20:26:58.0621807Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:58.0621932Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:58.0622027Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:58.0622174Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:58.0622274Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:58.0622395Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:58.0622475Z #define LONG_BIT 64 2025-05-07T20:26:58.0622582Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:58.0622679Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:58.0622808Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:58.0622913Z #define __fsfilcnt_t_defined 2025-05-07T20:26:58.0623008Z #define __blkcnt_t_defined 2025-05-07T20:26:58.0623344Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:58.0623478Z #define __USE_LARGEFILE 1 2025-05-07T20:26:58.0623607Z #define __cpp_constexpr 201603L 2025-05-07T20:26:58.0623700Z #define CUDART_VERSION 12060 2025-05-07T20:26:58.0623791Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:58.0623888Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:58.0623976Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:58.0624171Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:58.0624260Z #define __lldiv_t_defined 1 2025-05-07T20:26:58.0624343Z #define __SSE2__ 1 2025-05-07T20:26:58.0624422Z #define _IOLBF 1 2025-05-07T20:26:58.0624519Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:58.0624615Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:58.0624718Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:58.0625425Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:58.0625550Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:58.0625641Z #define __INT32_TYPE__ int 2025-05-07T20:26:58.0625753Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:58.0625875Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:58.0625989Z #define __cpp_exceptions 199711L 2025-05-07T20:26:58.0626086Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:58.0626194Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:58.0626286Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:58.0626410Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:58.0626568Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:58.0626664Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:58.0626757Z #define __SWORD_TYPE long int 2025-05-07T20:26:58.0626850Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:58.0626955Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:58.0627046Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:58.0627145Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:58.0627434Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:58.0627526Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:58.0627676Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:58.0627752Z #define _T_SIZE 2025-05-07T20:26:58.0627857Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:58.0627984Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:58.0628106Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:58.0628195Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:58.0628288Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:58.0628407Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:58.0628501Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:58.0628601Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.0628691Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:58.0628865Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:58.0628960Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:58.0629149Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:58.0629245Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:58.0629360Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.0629441Z #define __PIE__ 2 2025-05-07T20:26:58.0629546Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:58.0629642Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:58.0629831Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:58.0630051Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:58.0630143Z #define __nlink_t_defined 2025-05-07T20:26:58.0630268Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:58.0630382Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:58.0630469Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:58.0630726Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:58.0630846Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:58.0630957Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:58.0631059Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:58.0631152Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:58.0631237Z #define __FILE_defined 1 2025-05-07T20:26:58.0631413Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:58.0631511Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:58.0631605Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:58.0631715Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:58.0631829Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:58.0631936Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:58.0632037Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:58.0632120Z #define __INT16_C(c) c 2025-05-07T20:26:58.0632218Z #define __U32_TYPE unsigned int 2025-05-07T20:26:58.0632315Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:58.0632540Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:58.0632626Z #define __STDC__ 1 2025-05-07T20:26:58.0632722Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:58.0632819Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:58.0632915Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:58.0633064Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:58.0633159Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:58.0633257Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:58.0633355Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:58.0633470Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:58.0633575Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:58.0633669Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:58.0633775Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:58.0633855Z #define stdin stdin 2025-05-07T20:26:58.0633944Z #define __ino64_t_defined 2025-05-07T20:26:58.0634032Z #define STA_CLK 0x8000 2025-05-07T20:26:58.0634129Z #define __clockid_t_defined 1 2025-05-07T20:26:58.0634277Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:58.0634444Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:58.0634545Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:58.0634646Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:58.0634749Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:58.0634852Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:58.0635051Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:58.0635143Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:58.0635675Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:58.0635777Z #define DOMAIN 1 2025-05-07T20:26:58.0635879Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:58.0635983Z #define __NVCC__ 1 2025-05-07T20:26:58.0636171Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:58.0636283Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.0636388Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:58.0636488Z #define __throw_exception_again throw 2025-05-07T20:26:58.0636582Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:58.0636672Z #define __EXCEPTION_H 1 2025-05-07T20:26:58.0636769Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:58.0636870Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:58.0637173Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:58.0637288Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:58.0637391Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:58.0637485Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:58.0637589Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:58.0637689Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:58.0637833Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:58.0637944Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.0638056Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:58.0638150Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:58.0638253Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:58.0638349Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:58.0638448Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:58.0638581Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:58.0638675Z #define __useconds_t_defined 2025-05-07T20:26:58.0638773Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:58.0638958Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:58.0639104Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:58.0639190Z #define __SSE_MATH__ 1 2025-05-07T20:26:58.0639282Z #define _IO_wint_t wint_t 2025-05-07T20:26:58.0639374Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:58.0639551Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:58.0639656Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:58.0639766Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:58.0639859Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:58.0639953Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:58.0640034Z #define __USE_ATFILE 1 2025-05-07T20:26:58.0640125Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:58.0640219Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:58.0640305Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:58.0640531Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:58.0640625Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:58.0640725Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:58.0640829Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:58.0640938Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:58.0641016Z #define _STDLIB_H 1 2025-05-07T20:26:58.0641158Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:58.0641261Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.0641355Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:58.0641484Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.0641591Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:58.0641687Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:58.0641866Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:58.0642017Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:58.0642124Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:58.0642236Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:58.0642325Z #define __ldiv_t_defined 1 2025-05-07T20:26:58.0642503Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:58.0642594Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:58.0642766Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:58.0642876Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:58.0642973Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:58.0643155Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:58.0643252Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.0643350Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:58.0643434Z #define CUDART_CB 2025-05-07T20:26:58.0643534Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:58.0643656Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:58.0643742Z #define MB_LEN_MAX 16 2025-05-07T20:26:58.0643960Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:58.0644058Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:58.0644181Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:58.0644295Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:58.0644393Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:58.0644538Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:58.0644645Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:58.0644742Z #define _GNU_SOURCE 1 2025-05-07T20:26:58.0644831Z #define __stub_putmsg 2025-05-07T20:26:58.0644913Z #define __CUDACC__ 1 2025-05-07T20:26:58.0645004Z #define __N(msgid) (msgid) 2025-05-07T20:26:58.0645086Z #define __P(args) args 2025-05-07T20:26:58.0645336Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:58.0645439Z #define __cpp_init_captures 201304L 2025-05-07T20:26:58.0645541Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:58.0645629Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:58.0645730Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:58.0645810Z #define __WCHAR_T 2025-05-07T20:26:58.0645906Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:58.0645999Z #define __fsblkcnt_t_defined 2025-05-07T20:26:58.0646113Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:58.0646216Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:58.0646222Z 2025-05-07T20:26:58.0696743Z 2025-05-07T20:26:58.0697122Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:58.0697134Z 2025-05-07T20:26:59.9589067Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:26:59.9589457Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:26:59.9589771Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:26:59.9590090Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:26:59.9590408Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:26:59.9590620Z 2025-05-07T20:27:00.0226538Z 2025-05-07T20:27:00.0236872Z /usr/bin/nvidia-smi 2025-05-07T20:27:00.0242142Z + nvidia-smi 2025-05-07T20:27:00.0242387Z 2025-05-07T20:27:00.0415805Z Wed May 7 20:27:00 2025 2025-05-07T20:27:00.0416179Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:00.0416687Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:27:00.0417224Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:00.0417729Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:27:00.0418246Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:27:00.0418667Z | | | MIG M. | 2025-05-07T20:27:00.0418998Z |=========================================+========================+======================| 2025-05-07T20:27:00.0587333Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:27:00.0587779Z | 0% 26C P8 15W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:27:00.0588149Z | | | N/A | 2025-05-07T20:27:00.0588535Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:00.0592182Z 2025-05-07T20:27:00.0592810Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:00.0593241Z | Processes: | 2025-05-07T20:27:00.0593678Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:27:00.0594093Z | ID ID Usage | 2025-05-07T20:27:00.0594426Z |=========================================================================================| 2025-05-07T20:27:00.0598521Z | No running processes found | 2025-05-07T20:27:00.0598990Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:00.3120509Z 2025-05-07T20:27:00.3125709Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:27:00.3178846Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:27:00.3179421Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:27:00.3191350Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:27:00.3191694Z env: 2025-05-07T20:27:00.3191925Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:27:00.3192211Z BUILD_ENV: build_binary 2025-05-07T20:27:00.3192465Z BUILD_TARGET: genai 2025-05-07T20:27:00.3192694Z BUILD_VARIANT: cuda 2025-05-07T20:27:00.3192931Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:27:00.3193183Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:27:00.3193487Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:27:00.3193816Z ##[endgroup] 2025-05-07T20:27:00.6540813Z ################################################################################ 2025-05-07T20:27:00.6541192Z # Install PyTorch (PIP) 2025-05-07T20:27:00.6541430Z # 2025-05-07T20:27:00.6557560Z # [2025-05-07T20:27:00.655Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:27:00.6558029Z ################################################################################ 2025-05-07T20:27:00.6558247Z 2025-05-07T20:27:00.6586842Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:27:01.6505978Z Channels: 2025-05-07T20:27:01.6506229Z - conda-forge 2025-05-07T20:27:01.6506452Z Platform: linux-64 2025-05-07T20:27:05.1313655Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:27:05.8633797Z Solving environment: \ | / done 2025-05-07T20:27:06.0844530Z 2025-05-07T20:27:06.0844801Z ## Package Plan ## 2025-05-07T20:27:06.0845072Z 2025-05-07T20:27:06.0845449Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:27:06.0846206Z 2025-05-07T20:27:06.0846459Z added / updated specs: 2025-05-07T20:27:06.0847076Z - numpy 2025-05-07T20:27:06.0847358Z 2025-05-07T20:27:06.0847396Z 2025-05-07T20:27:06.0847690Z The following packages will be downloaded: 2025-05-07T20:27:06.0848212Z 2025-05-07T20:27:06.0848445Z package | build 2025-05-07T20:27:06.0849043Z ---------------------------|----------------- 2025-05-07T20:27:06.0849515Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:27:06.0849973Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:27:06.0850426Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:27:06.0850887Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:27:06.0851353Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:27:06.0851832Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:27:06.0852286Z numpy-2.2.5 | py312h72c5963_0 8.1 MB conda-forge 2025-05-07T20:27:06.0852683Z ------------------------------------------------------------ 2025-05-07T20:27:06.0853350Z Total: 15.4 MB 2025-05-07T20:27:06.0853564Z 2025-05-07T20:27:06.0853695Z The following NEW packages will be INSTALLED: 2025-05-07T20:27:06.0853925Z 2025-05-07T20:27:06.0854139Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:27:06.0854647Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:27:06.0855164Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:27:06.0855670Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:27:06.0856193Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:27:06.0856741Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:27:06.0857471Z numpy conda-forge/linux-64::numpy-2.2.5-py312h72c5963_0 2025-05-07T20:27:06.0857759Z 2025-05-07T20:27:06.0857764Z 2025-05-07T20:27:06.0857768Z 2025-05-07T20:27:06.0857913Z Downloading and Extracting Packages: ...working... 2025-05-07T20:27:06.0858293Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:06.0858521Z 2025-05-07T20:27:06.0858789Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:06.0859035Z 2025-05-07T20:27:06.0859039Z 2025-05-07T20:27:06.0886765Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:27:06.0887038Z 2025-05-07T20:27:06.0887042Z 2025-05-07T20:27:06.0888043Z 2025-05-07T20:27:06.0900619Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:27:06.0900968Z 2025-05-07T20:27:06.0900974Z 2025-05-07T20:27:06.0900980Z 2025-05-07T20:27:06.0906035Z 2025-05-07T20:27:06.0918399Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:06.0918776Z 2025-05-07T20:27:06.0918797Z 2025-05-07T20:27:06.0918804Z 2025-05-07T20:27:06.0918810Z 2025-05-07T20:27:06.0925789Z 2025-05-07T20:27:06.0931421Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:06.0931687Z 2025-05-07T20:27:06.0931691Z 2025-05-07T20:27:06.0931695Z 2025-05-07T20:27:06.0931699Z 2025-05-07T20:27:06.0931702Z 2025-05-07T20:27:06.0934661Z 2025-05-07T20:27:06.3370808Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:06.3371088Z 2025-05-07T20:27:06.3371093Z 2025-05-07T20:27:06.3371664Z 2025-05-07T20:27:06.3392801Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:27:06.3393066Z 2025-05-07T20:27:06.3393070Z 2025-05-07T20:27:06.3393074Z 2025-05-07T20:27:06.3393078Z 2025-05-07T20:27:06.3428581Z libblas-3.9.0 | 16 KB | #########7 | 97%  2025-05-07T20:27:06.3428835Z 2025-05-07T20:27:06.3428839Z 2025-05-07T20:27:06.3428842Z 2025-05-07T20:27:06.3428846Z 2025-05-07T20:27:06.3474507Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.3474811Z 2025-05-07T20:27:06.3474816Z 2025-05-07T20:27:06.3479338Z 2025-05-07T20:27:06.3723728Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:06.3724019Z 2025-05-07T20:27:06.3724024Z 2025-05-07T20:27:06.3724027Z 2025-05-07T20:27:06.3724031Z 2025-05-07T20:27:06.3743566Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.3743830Z 2025-05-07T20:27:06.3743834Z 2025-05-07T20:27:06.3743838Z 2025-05-07T20:27:06.3781925Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:06.3816164Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:06.3816402Z 2025-05-07T20:27:06.3816406Z 2025-05-07T20:27:06.3816410Z 2025-05-07T20:27:06.3816413Z 2025-05-07T20:27:06.3816418Z 2025-05-07T20:27:06.3816421Z 2025-05-07T20:27:06.3821699Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:06.3822114Z 2025-05-07T20:27:06.3822120Z 2025-05-07T20:27:06.3822399Z 2025-05-07T20:27:06.3822405Z 2025-05-07T20:27:06.3822410Z 2025-05-07T20:27:06.3822415Z 2025-05-07T20:27:06.3985045Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.3985430Z 2025-05-07T20:27:06.4023083Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:06.4023454Z 2025-05-07T20:27:06.4024105Z 2025-05-07T20:27:06.4024112Z 2025-05-07T20:27:06.4024117Z 2025-05-07T20:27:06.4024122Z 2025-05-07T20:27:06.4033393Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:06.4033767Z 2025-05-07T20:27:06.4033772Z 2025-05-07T20:27:06.4033778Z 2025-05-07T20:27:06.4033783Z 2025-05-07T20:27:06.4033788Z 2025-05-07T20:27:06.4052915Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.4053779Z 2025-05-07T20:27:06.4053784Z 2025-05-07T20:27:06.4053788Z 2025-05-07T20:27:06.4053792Z 2025-05-07T20:27:06.4053796Z 2025-05-07T20:27:06.4054149Z 2025-05-07T20:27:06.4142661Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.4143055Z 2025-05-07T20:27:06.4143264Z 2025-05-07T20:27:06.4430423Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:27:06.4430802Z 2025-05-07T20:27:06.4430807Z 2025-05-07T20:27:06.4430813Z 2025-05-07T20:27:06.4430818Z 2025-05-07T20:27:06.4430823Z 2025-05-07T20:27:06.4599903Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.4600281Z 2025-05-07T20:27:06.4600287Z 2025-05-07T20:27:06.4782563Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:06.5094399Z numpy-2.2.5 | 8.1 MB | #######8 | 79% 2025-05-07T20:27:06.5097106Z 2025-05-07T20:27:06.5097586Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:06.5097936Z 2025-05-07T20:27:06.5284035Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:06.5665535Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:06.5666332Z 2025-05-07T20:27:06.5666339Z 2025-05-07T20:27:06.5671812Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:06.5672167Z 2025-05-07T20:27:06.5672173Z 2025-05-07T20:27:06.6925987Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:06.6926709Z 2025-05-07T20:27:06.9803406Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:06.9810540Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:06.9811010Z 2025-05-07T20:27:06.9811275Z 2025-05-07T20:27:06.9811530Z  2025-05-07T20:27:06.9811798Z 2025-05-07T20:27:06.9811804Z 2025-05-07T20:27:06.9812036Z  2025-05-07T20:27:06.9812241Z 2025-05-07T20:27:06.9812245Z 2025-05-07T20:27:06.9812249Z 2025-05-07T20:27:06.9812480Z  2025-05-07T20:27:06.9812766Z 2025-05-07T20:27:06.9812770Z 2025-05-07T20:27:06.9812774Z 2025-05-07T20:27:06.9812777Z 2025-05-07T20:27:06.9812956Z  2025-05-07T20:27:06.9813257Z 2025-05-07T20:27:06.9813260Z 2025-05-07T20:27:06.9813273Z 2025-05-07T20:27:06.9813277Z 2025-05-07T20:27:06.9813281Z 2025-05-07T20:27:06.9813457Z  2025-05-07T20:27:06.9813670Z 2025-05-07T20:27:06.9813673Z 2025-05-07T20:27:06.9813677Z 2025-05-07T20:27:06.9813681Z 2025-05-07T20:27:06.9813684Z 2025-05-07T20:27:06.9813688Z 2025-05-07T20:27:06.9813875Z  done 2025-05-07T20:27:07.0816244Z Preparing transaction: \ done 2025-05-07T20:27:07.2824995Z Verifying transaction: / - done 2025-05-07T20:27:07.3833985Z Executing transaction: | done 2025-05-07T20:27:07.5598670Z ################################################################################ 2025-05-07T20:27:07.5599412Z # Install Package From PyTorch PIP: torch 2025-05-07T20:27:07.5599845Z # 2025-05-07T20:27:07.5617503Z # [2025-05-07T20:27:07.561Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:27:07.5618192Z ################################################################################ 2025-05-07T20:27:07.5618497Z 2025-05-07T20:27:07.5634938Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:27:07.6522114Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:27:07.6522486Z ################################################################################ 2025-05-07T20:27:07.6522823Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:27:07.6523120Z # 2025-05-07T20:27:07.6541003Z # [2025-05-07T20:27:07.653Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:27:07.6541673Z ################################################################################ 2025-05-07T20:27:07.6564767Z 2025-05-07T20:27:07.6565024Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:27:07.6590912Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:27:07.6607075Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:27:07.6607624Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:07.6614742Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:27:07.6622895Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:27:07.6643454Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:28:30.4022457Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:28:30.4022948Z Collecting torch 2025-05-07T20:28:30.4023640Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:28:30.4024362Z Collecting filelock (from torch) 2025-05-07T20:28:30.4024869Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:28:30.4025910Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (4.13.2) 2025-05-07T20:28:30.4026966Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (78.1.1) 2025-05-07T20:28:30.4027627Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:28:30.4028156Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:28:30.4028982Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 36.1 MB/s eta 0:00:00 2025-05-07T20:28:30.4029341Z Collecting networkx (from torch) 2025-05-07T20:28:30.4029848Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:28:30.4030504Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 16.9 MB/s eta 0:00:00 2025-05-07T20:28:30.4030847Z Collecting jinja2 (from torch) 2025-05-07T20:28:30.4031327Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:28:30.4031834Z Collecting fsspec (from torch) 2025-05-07T20:28:30.4032322Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:28:30.4032893Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:28:30.4033601Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:28:30.4034382Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 62.0 MB/s eta 0:00:00 2025-05-07T20:28:30.4034804Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:28:30.4035522Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:28:30.4037097Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 11.5 MB/s eta 0:00:00 2025-05-07T20:28:30.4037504Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:28:30.4038395Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:28:30.4039178Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 38.8 MB/s eta 0:00:00 2025-05-07T20:28:30.4039558Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:28:30.4040227Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:28:30.4040977Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 33.3 MB/s eta 0:00:00 2025-05-07T20:28:30.4041612Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:28:30.4042388Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:28:30.4043229Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 59.9 MB/s eta 0:00:00 2025-05-07T20:28:30.4043609Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:28:30.4044272Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:28:30.4045025Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 100.4 MB/s eta 0:00:00 2025-05-07T20:28:30.4045404Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:28:30.4046073Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:28:30.4046848Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 145.0 MB/s eta 0:00:00 2025-05-07T20:28:30.4047251Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:28:30.4047934Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:28:30.4048702Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 132.5 MB/s eta 0:00:00 2025-05-07T20:28:30.4049095Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:28:30.4049778Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:28:30.4050547Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 102.0 MB/s eta 0:00:00 2025-05-07T20:28:30.4050935Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:28:30.4051676Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:28:30.4052447Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 112.4 MB/s eta 0:00:00 2025-05-07T20:28:30.4052830Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:28:30.4053717Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:28:30.4054472Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:28:30.4055112Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:28:30.4055771Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:28:30.4056533Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:28:30.4057379Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 192.4 MB/s eta 0:00:00 2025-05-07T20:28:30.4057767Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:28:30.4058540Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:30.4059736Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:28:30.4060551Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:30.4061365Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:28:30.4061915Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:28:30.4062557Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 46.5 MB/s eta 0:00:00 2025-05-07T20:28:30.4062922Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:28:30.4063775Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB) 2025-05-07T20:28:30.4064820Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp312-cp312-manylinux_2_28_x86_64.whl (825.4 MB) 2025-05-07T20:28:30.4065624Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.4/825.4 MB 37.5 MB/s eta 0:00:00 2025-05-07T20:28:30.4066370Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:28:30.4067202Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 55.5 MB/s eta 0:00:00 2025-05-07T20:28:30.4067941Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:28:30.4068766Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 125.1 MB/s eta 0:00:00 2025-05-07T20:28:30.4069553Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:28:30.4070437Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 133.9 MB/s eta 0:00:00 2025-05-07T20:28:30.4072191Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:28:30.4074105Z 2025-05-07T20:28:30.4076047Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:28:30.4078198Z 2025-05-07T20:28:32.6246516Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:28:32.6248869Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:28:36.1134555Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:39.6444422Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:39.6445089Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:43.0632400Z True 2025-05-07T20:28:43.0632694Z True 2025-05-07T20:28:43.0632801Z 2025-05-07T20:28:43.1255121Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:43.1292926Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:43.1293628Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:43.1305619Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:43.1305969Z env: 2025-05-07T20:28:43.1306200Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:43.1306496Z BUILD_ENV: build_binary 2025-05-07T20:28:43.1306744Z BUILD_TARGET: genai 2025-05-07T20:28:43.1306973Z BUILD_VARIANT: cuda 2025-05-07T20:28:43.1307215Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:43.1307463Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:43.1307762Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:43.1308089Z ##[endgroup] 2025-05-07T20:28:43.4653619Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:43.4655475Z ################################################################################ 2025-05-07T20:28:43.4655970Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:43.4656331Z # 2025-05-07T20:28:43.4671647Z # [2025-05-07T20:28:43.466Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:43.4672062Z ################################################################################ 2025-05-07T20:28:43.4672277Z 2025-05-07T20:28:43.4687042Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:43.5579857Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:43.5589977Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:43.5590594Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:43.5590984Z 2025-05-07T20:28:43.6499797Z 2025-05-07T20:28:43.6500382Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:43.6523175Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:49.5930679Z Collecting environment information... 2025-05-07T20:28:49.5931037Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:28:49.5931323Z Is debug build: False 2025-05-07T20:28:49.5931586Z CUDA used to build PyTorch: 12.6 2025-05-07T20:28:49.5931871Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:49.5932044Z 2025-05-07T20:28:49.5932148Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:49.5932468Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:49.5932784Z Clang version: Could not collect 2025-05-07T20:28:49.5933138Z CMake version: Could not collect 2025-05-07T20:28:49.5933417Z Libc version: glibc-2.34 2025-05-07T20:28:49.5933576Z 2025-05-07T20:28:49.5933879Z Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:28:49.5934484Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:49.5934888Z Is CUDA available: True 2025-05-07T20:28:49.5935145Z CUDA runtime version: 12.6.85 2025-05-07T20:28:49.5935421Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:49.5935727Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:49.5936327Z Nvidia driver version: 570.133.07 2025-05-07T20:28:49.5936606Z cuDNN version: Could not collect 2025-05-07T20:28:49.5936881Z HIP runtime version: N/A 2025-05-07T20:28:49.5937122Z MIOpen runtime version: N/A 2025-05-07T20:28:49.5937375Z Is XNNPACK available: True 2025-05-07T20:28:49.5937539Z 2025-05-07T20:28:49.5937619Z CPU: 2025-05-07T20:28:49.5937835Z Architecture: x86_64 2025-05-07T20:28:49.5938156Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:49.5938567Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:49.5938970Z Byte Order: Little Endian 2025-05-07T20:28:49.5939275Z CPU(s): 16 2025-05-07T20:28:49.5939566Z On-line CPU(s) list: 0-15 2025-05-07T20:28:49.5940084Z Vendor ID: AuthenticAMD 2025-05-07T20:28:49.5940418Z Model name: AMD EPYC 7R32 2025-05-07T20:28:49.5940732Z CPU family: 23 2025-05-07T20:28:49.5941024Z Model: 49 2025-05-07T20:28:49.5941300Z Thread(s) per core: 2 2025-05-07T20:28:49.5941585Z Core(s) per socket: 8 2025-05-07T20:28:49.5941864Z Socket(s): 1 2025-05-07T20:28:49.5942136Z Stepping: 0 2025-05-07T20:28:49.5942421Z BogoMIPS: 5599.99 2025-05-07T20:28:49.5944443Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:49.5946451Z Hypervisor vendor: KVM 2025-05-07T20:28:49.5946757Z Virtualization type: full 2025-05-07T20:28:49.5947089Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:49.5947442Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:49.5947801Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:49.5948153Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:49.5948470Z NUMA node(s): 1 2025-05-07T20:28:49.5948752Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:49.5949081Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:49.5949447Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:49.5949798Z Vulnerability L1tf: Not affected 2025-05-07T20:28:49.5950143Z Vulnerability Mds: Not affected 2025-05-07T20:28:49.5950494Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:49.5950837Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:49.5951199Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:49.5951731Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:49.5952299Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:49.5952823Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:49.5953497Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:49.5954335Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:49.5954990Z Vulnerability Srbds: Not affected 2025-05-07T20:28:49.5955429Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:49.5955660Z 2025-05-07T20:28:49.5955762Z Versions of relevant libraries: 2025-05-07T20:28:49.5956024Z [pip3] numpy==2.2.5 2025-05-07T20:28:49.5956259Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:28:49.5956557Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:28:49.5956865Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:28:49.5957168Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:28:49.5957480Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:28:49.5957762Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:28:49.5958043Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:28:49.5958342Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:28:49.5958693Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:28:49.5959108Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:49.5960097Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:49.5960409Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:28:49.5960731Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:28:49.5961034Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:49.5961328Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:28:49.5961699Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:49.5962177Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:49.5962738Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:49.5963490Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:49.5964070Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:49.5964703Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:49.5965327Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.5965909Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:28:49.5966444Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:49.5967076Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:49.5967659Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.5968189Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:49.5968803Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.5969393Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.5979032Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:49.5979520Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:28:49.5979978Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:49.5980437Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:49.5980890Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.5981330Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:28:49.5981780Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.5982236Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:28:49.5982691Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:49.5983162Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:49.5983643Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.5984112Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:28:49.5984576Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:49.5985241Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:49.5985693Z [conda] numpy 2.2.5 py312h72c5963_0 conda-forge 2025-05-07T20:28:49.5986139Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:28:49.5986626Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:28:49.5987111Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:49.5987600Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:49.5988070Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:28:49.5988655Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:28:49.5989121Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:28:49.5989602Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:28:49.5990074Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:28:49.5990556Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:49.5991030Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:49.5991493Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:28:49.5991958Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:49.5992423Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:49.5992875Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:28:49.5993141Z 2025-05-07T20:28:49.6695599Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:49.6696265Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:49.6708241Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:49.6708599Z env: 2025-05-07T20:28:49.6708837Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:49.6709136Z BUILD_ENV: build_binary 2025-05-07T20:28:49.6709456Z BUILD_TARGET: genai 2025-05-07T20:28:49.6709724Z BUILD_VARIANT: cuda 2025-05-07T20:28:49.6709957Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:49.6710229Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:49.6710532Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:49.6710862Z ##[endgroup] 2025-05-07T20:28:50.0099869Z ################################################################################ 2025-05-07T20:28:50.0100265Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:50.0100535Z # 2025-05-07T20:28:50.0115545Z # [2025-05-07T20:28:50.011Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:50.0115947Z ################################################################################ 2025-05-07T20:28:50.0116185Z 2025-05-07T20:28:50.0130947Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:50.1001901Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:50.1023193Z [BUILD] Running git submodules update ... 2025-05-07T20:28:50.1044687Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:50.1410036Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:50.1410492Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:50.1410940Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:50.1411333Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:50.1411728Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:50.1412177Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:50.1412578Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:50.1445853Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:50.1999505Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:50.2021996Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:52.6036838Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:52.6220076Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:52.7348254Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:52.7379319Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:52.9610026Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:52.9651609Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:53.0824200Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:53.0869916Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:53.4226693Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:53.4289568Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:53.4838071Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:53.4842920Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:53.5632619Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:53.5663920Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:53.6176685Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:53.6671677Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:53.6722710Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:53.8045790Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:53.8076852Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:53.9218818Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:53.9268298Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:53.9926338Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:54.0584104Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:54.0627844Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:54.1770280Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:54.1799648Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:54.2985285Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:54.3033636Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:54.4296523Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:54.4325651Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:54.5364572Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:54.5396165Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:54.6478347Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:54.6512470Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:54.7506685Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:54.7549399Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:54.7994984Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:54.8456971Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:54.8486364Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:54.8938516Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:54.9461991Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:54.9490801Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:55.0003499Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:55.0699000Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:55.0729903Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:55.1287298Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:55.1873138Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:55.2451619Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:55.7922777Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 51.0 MB/s eta 0:00:00 2025-05-07T20:28:55.7956627Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:55.8521564Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:55.9151074Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:55.9694718Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:56.0314069Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:56.0882402Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB) 2025-05-07T20:28:56.1532864Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 8.0 MB/s eta 0:00:00 2025-05-07T20:28:56.1582172Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:56.2199571Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:56.2814714Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:56.3402602Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:56.4071146Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:56.4615257Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:56.5188663Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:56.5822125Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:56.6343077Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:56.6946226Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:56.8635902Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:59.2492447Z 2025-05-07T20:28:59.2537842Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:28:59.4220686Z ################################################################################ 2025-05-07T20:28:59.4221089Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:59.4221359Z # 2025-05-07T20:28:59.4239226Z # [2025-05-07T20:28:59.423Z] + install_triton_pip build_binary 2025-05-07T20:28:59.4239617Z ################################################################################ 2025-05-07T20:28:59.4239837Z 2025-05-07T20:28:59.4240058Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:59.4240491Z ################################################################################ 2025-05-07T20:28:59.4240838Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:59.4241157Z # 2025-05-07T20:28:59.4257408Z # [2025-05-07T20:28:59.425Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:59.4258043Z ################################################################################ 2025-05-07T20:28:59.4258267Z 2025-05-07T20:28:59.4273675Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:59.5170868Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:59.5171552Z ################################################################################ 2025-05-07T20:28:59.5172196Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:59.5172513Z # 2025-05-07T20:28:59.5188655Z # [2025-05-07T20:28:59.518Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:59.5189132Z ################################################################################ 2025-05-07T20:28:59.5189341Z 2025-05-07T20:28:59.5236362Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:59.5254032Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:59.5254552Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:59.5261791Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:59.5271311Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:59.5292887Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:07.4012483Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:29:07.4013900Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:29:07.4014604Z 2025-05-07T20:29:07.4014818Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:07.4015218Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:07.4016014Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:29:07.4017201Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:29:07.4018260Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 52.6 MB/s eta 0:00:00 2025-05-07T20:29:07.4018634Z Installing collected packages: pytorch-triton 2025-05-07T20:29:07.4018968Z Attempting uninstall: pytorch-triton 2025-05-07T20:29:07.4019351Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:29:07.4019765Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:29:07.4020189Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:29:07.4020624Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:29:07.4021200Z 2025-05-07T20:29:09.6245839Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:29:09.6249830Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:29:11.7698733Z ################################################################################ 2025-05-07T20:29:11.7699195Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:29:11.7699590Z ################################################################################ 2025-05-07T20:29:11.7699816Z 2025-05-07T20:29:13.8123993Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:29:15.9878553Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:29:15.9882080Z [BUILD] Successfully ran git submodules update 2025-05-07T20:29:15.9915097Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:15.9915592Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:15.9927205Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:15.9927567Z env: 2025-05-07T20:29:15.9927799Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:15.9928091Z BUILD_ENV: build_binary 2025-05-07T20:29:15.9928343Z BUILD_TARGET: genai 2025-05-07T20:29:15.9928577Z BUILD_VARIANT: cuda 2025-05-07T20:29:15.9928816Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:15.9929070Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:15.9929368Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:15.9929770Z ##[endgroup] 2025-05-07T20:29:16.3295744Z ################################################################################ 2025-05-07T20:29:16.3296136Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:29:16.3296401Z # 2025-05-07T20:29:16.3312502Z # [2025-05-07T20:29:16.330Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:16.3313146Z ################################################################################ 2025-05-07T20:29:16.3313364Z 2025-05-07T20:29:16.3313735Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:16.3314422Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:16.3314762Z 2025-05-07T20:29:16.3432667Z 839b6c4a76b132decd86ba2192408e2709e83cea fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:16.3435357Z 2025-05-07T20:29:16.3435748Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:16.3569602Z 2025-05-07T20:29:16.3570402Z 1b0d0e6113168fc8d58f5641aa11b1400e22aeae573cc3e05b442ee4be9a1e2d fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:16.3573052Z 2025-05-07T20:29:16.3582387Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:16.3582897Z 2025-05-07T20:29:16.3805145Z 54d55da1a6aeedb5d1904417fe635ccb fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:16.3807340Z 2025-05-07T20:29:16.3817312Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl ... 2025-05-07T20:29:16.3839611Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:19.0527472Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:19.0528396Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:29:19.0529230Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:19.0529669Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:19.0529932Z 2025-05-07T20:29:26.0228485Z ################################################################################ 2025-05-07T20:29:26.0229284Z [CHECK] !!!! INFO !!!! 2025-05-07T20:29:26.0229663Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:29:26.0230080Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:29:26.0230398Z [CHECK] 2025-05-07T20:29:26.0230720Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:29:26.0231208Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:29:26.0231603Z ################################################################################ 2025-05-07T20:29:26.0231820Z 2025-05-07T20:29:26.0231945Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:29:30.0443915Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:29:34.0495168Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:38.0567072Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:38.0570306Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:50.0720150Z ################################################################################ 2025-05-07T20:29:50.0720698Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:50.0721165Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:50.0721517Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:50.0721858Z ################################################################################ 2025-05-07T20:29:50.0722073Z 2025-05-07T20:29:58.0875223Z ################################################################################ 2025-05-07T20:29:58.0875669Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:58.0877046Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:58.0878590Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:58.0879106Z ################################################################################ 2025-05-07T20:29:58.0879331Z 2025-05-07T20:29:58.0879488Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:30:02.0989163Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:30:06.1091444Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:30:10.2298220Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:30:14.2328479Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:30:14.2332298Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:18.1485730Z fbgemm.nccl_init 2025-05-07T20:30:18.1487788Z 2025-05-07T20:30:18.2113438Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:22.1281832Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:22.1282221Z 2025-05-07T20:30:22.1900007Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:30:26.1196413Z fbgemm.rope_qkv_decoding 2025-05-07T20:30:26.1196635Z 2025-05-07T20:30:26.1815493Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:30:26.1816135Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:30:26.1851499Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:26.1851962Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:26.1866572Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:30:26.1866934Z env: 2025-05-07T20:30:26.1867176Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:30:26.1867476Z BUILD_ENV: build_binary 2025-05-07T20:30:26.1867927Z BUILD_TARGET: genai 2025-05-07T20:30:26.1868164Z BUILD_VARIANT: cuda 2025-05-07T20:30:26.1868401Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:30:26.1868669Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:30:26.1868979Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:30:26.1869313Z ##[endgroup] 2025-05-07T20:30:26.5231475Z ################################################################################ 2025-05-07T20:30:26.5231872Z # Test All FBGEMM-GPU Modules 2025-05-07T20:30:26.5232132Z # 2025-05-07T20:30:26.5246732Z # [2025-05-07T20:30:26.524Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:30:26.5247138Z ################################################################################ 2025-05-07T20:30:26.5247359Z 2025-05-07T20:30:34.5001428Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:34.5002003Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:34.5002397Z [TEST] Determined the test directories: 2025-05-07T20:30:34.5002733Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:34.5003033Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:34.5003335Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:34.5003519Z 2025-05-07T20:30:34.5012372Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:34.5019215Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:34.5019653Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:34.5019928Z 2025-05-07T20:30:34.9245025Z 2025-05-07T20:30:34.9245219Z [TEST] Installing PyTest ... 2025-05-07T20:30:34.9269753Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:36.0295302Z Channels: 2025-05-07T20:30:36.0295754Z - conda-forge 2025-05-07T20:30:36.0296202Z Platform: linux-64 2025-05-07T20:30:39.3250143Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:40.4662025Z Solving environment: \ | / done 2025-05-07T20:30:40.6909468Z 2025-05-07T20:30:40.6910276Z ## Package Plan ## 2025-05-07T20:30:40.6910517Z 2025-05-07T20:30:40.6912930Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:40.6913426Z 2025-05-07T20:30:40.6913585Z added / updated specs: 2025-05-07T20:30:40.6914117Z - expecttest 2025-05-07T20:30:40.6914477Z - pytest 2025-05-07T20:30:40.6914632Z 2025-05-07T20:30:40.6914636Z 2025-05-07T20:30:40.6914822Z The following packages will be downloaded: 2025-05-07T20:30:40.6915079Z 2025-05-07T20:30:40.6915284Z package | build 2025-05-07T20:30:40.6915738Z ---------------------------|----------------- 2025-05-07T20:30:40.6916210Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:40.6916766Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:40.6917436Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:40.6917977Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:40.6918573Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:40.6919049Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:40.6919549Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:40.6920465Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:40.6921085Z ------------------------------------------------------------ 2025-05-07T20:30:40.6921515Z Total: 428 KB 2025-05-07T20:30:40.6921868Z 2025-05-07T20:30:40.6922051Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:40.6922299Z 2025-05-07T20:30:40.6922565Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:40.6923348Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:40.6924115Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:40.6924772Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:40.6925333Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:40.6925947Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:40.6926440Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:40.6926961Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:40.6927300Z 2025-05-07T20:30:40.6927355Z 2025-05-07T20:30:40.6927360Z 2025-05-07T20:30:40.6927539Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:40.6928028Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:40.6928278Z 2025-05-07T20:30:40.6928663Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:40.6929047Z 2025-05-07T20:30:40.6929051Z 2025-05-07T20:30:40.6932375Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:40.6932734Z 2025-05-07T20:30:40.6932738Z 2025-05-07T20:30:40.6937018Z 2025-05-07T20:30:40.6951702Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:40.6952054Z 2025-05-07T20:30:40.6952058Z 2025-05-07T20:30:40.6952062Z 2025-05-07T20:30:40.6955153Z 2025-05-07T20:30:40.6987785Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:40.6988229Z 2025-05-07T20:30:40.6988234Z 2025-05-07T20:30:40.6988238Z 2025-05-07T20:30:40.6988241Z 2025-05-07T20:30:40.6988245Z 2025-05-07T20:30:40.6989393Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:40.6989780Z 2025-05-07T20:30:40.6989783Z 2025-05-07T20:30:40.6989787Z 2025-05-07T20:30:40.6989794Z 2025-05-07T20:30:40.6989798Z 2025-05-07T20:30:40.6989993Z 2025-05-07T20:30:40.6992761Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:40.6993135Z 2025-05-07T20:30:40.6993140Z 2025-05-07T20:30:40.6993144Z 2025-05-07T20:30:40.6993148Z 2025-05-07T20:30:40.6993151Z 2025-05-07T20:30:40.6993155Z 2025-05-07T20:30:40.6993159Z 2025-05-07T20:30:40.8163669Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:40.8163981Z 2025-05-07T20:30:40.8163986Z 2025-05-07T20:30:40.8163990Z 2025-05-07T20:30:40.8217452Z 2025-05-07T20:30:40.8585924Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:30:40.8586234Z 2025-05-07T20:30:40.8586238Z 2025-05-07T20:30:40.8586242Z 2025-05-07T20:30:40.8586246Z 2025-05-07T20:30:40.9824487Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:40.9824819Z 2025-05-07T20:30:40.9824823Z 2025-05-07T20:30:40.9824827Z 2025-05-07T20:30:40.9824831Z 2025-05-07T20:30:40.9826481Z 2025-05-07T20:30:40.9881128Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:30:40.9881434Z 2025-05-07T20:30:40.9881439Z 2025-05-07T20:30:40.9881442Z 2025-05-07T20:30:40.9881446Z 2025-05-07T20:30:40.9881981Z 2025-05-07T20:30:41.0539837Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:41.0541466Z 2025-05-07T20:30:41.0552064Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:30:41.0552549Z 2025-05-07T20:30:41.0552553Z 2025-05-07T20:30:41.0552557Z 2025-05-07T20:30:41.0552561Z 2025-05-07T20:30:41.0552565Z 2025-05-07T20:30:41.0554612Z 2025-05-07T20:30:41.0569821Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:41.0570140Z 2025-05-07T20:30:41.0570144Z 2025-05-07T20:30:41.0570148Z 2025-05-07T20:30:41.0570151Z 2025-05-07T20:30:41.0570155Z 2025-05-07T20:30:41.0571471Z 2025-05-07T20:30:41.0646567Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:41.0647436Z 2025-05-07T20:30:41.0695871Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:41.0869297Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:30:41.0886499Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:41.0886861Z 2025-05-07T20:30:41.0886866Z 2025-05-07T20:30:41.0911516Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:30:41.0911862Z 2025-05-07T20:30:41.0912532Z 2025-05-07T20:30:41.0972984Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:41.0973348Z 2025-05-07T20:30:41.0973362Z 2025-05-07T20:30:41.0973366Z 2025-05-07T20:30:41.0973370Z 2025-05-07T20:30:41.0973374Z 2025-05-07T20:30:41.0973377Z 2025-05-07T20:30:41.0983308Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:41.0983695Z 2025-05-07T20:30:41.0983700Z 2025-05-07T20:30:41.0983703Z 2025-05-07T20:30:41.0983707Z 2025-05-07T20:30:41.1039708Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:41.1040028Z 2025-05-07T20:30:41.1040037Z 2025-05-07T20:30:41.1040041Z 2025-05-07T20:30:41.1040044Z 2025-05-07T20:30:41.1040048Z 2025-05-07T20:30:41.1176432Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:41.1176777Z 2025-05-07T20:30:41.1176780Z 2025-05-07T20:30:41.1178446Z 2025-05-07T20:30:41.1202866Z pluggy-1.5.0 | 23 KB | ######9 | 69%  2025-05-07T20:30:41.1203178Z 2025-05-07T20:30:41.1203182Z 2025-05-07T20:30:41.1203186Z 2025-05-07T20:30:41.1292983Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:41.1293328Z 2025-05-07T20:30:41.1293338Z 2025-05-07T20:30:41.1293342Z 2025-05-07T20:30:41.1293345Z 2025-05-07T20:30:41.1293349Z 2025-05-07T20:30:41.1293353Z 2025-05-07T20:30:41.1294020Z 2025-05-07T20:30:41.1311226Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:41.1311558Z 2025-05-07T20:30:41.1311564Z 2025-05-07T20:30:41.1311579Z 2025-05-07T20:30:41.1311584Z 2025-05-07T20:30:41.1311726Z 2025-05-07T20:30:41.1311730Z 2025-05-07T20:30:41.1313143Z 2025-05-07T20:30:41.1345205Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:41.1345811Z 2025-05-07T20:30:41.1537003Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:41.1537352Z 2025-05-07T20:30:41.1537833Z 2025-05-07T20:30:41.1634225Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:41.1634542Z 2025-05-07T20:30:41.1634546Z 2025-05-07T20:30:41.1635691Z 2025-05-07T20:30:41.1759586Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:41.1760169Z 2025-05-07T20:30:41.1760176Z 2025-05-07T20:30:41.1760183Z 2025-05-07T20:30:41.1760190Z 2025-05-07T20:30:41.1760197Z 2025-05-07T20:30:41.1760203Z 2025-05-07T20:30:41.1760210Z 2025-05-07T20:30:41.1833447Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:41.1834128Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:41.1840526Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:41.1841095Z 2025-05-07T20:30:41.1841486Z 2025-05-07T20:30:41.1841856Z  2025-05-07T20:30:41.1842093Z 2025-05-07T20:30:41.1842097Z 2025-05-07T20:30:41.1842332Z  2025-05-07T20:30:41.1842584Z 2025-05-07T20:30:41.1842588Z 2025-05-07T20:30:41.1842591Z 2025-05-07T20:30:41.1843078Z  2025-05-07T20:30:41.1843439Z 2025-05-07T20:30:41.1843443Z 2025-05-07T20:30:41.1843447Z 2025-05-07T20:30:41.1843451Z 2025-05-07T20:30:41.1843673Z  2025-05-07T20:30:41.1843942Z 2025-05-07T20:30:41.1843945Z 2025-05-07T20:30:41.1843949Z 2025-05-07T20:30:41.1843953Z 2025-05-07T20:30:41.1843956Z 2025-05-07T20:30:41.1844314Z  2025-05-07T20:30:41.1844568Z 2025-05-07T20:30:41.1844572Z 2025-05-07T20:30:41.1844575Z 2025-05-07T20:30:41.1844579Z 2025-05-07T20:30:41.1844582Z 2025-05-07T20:30:41.1844586Z 2025-05-07T20:30:41.1844857Z  2025-05-07T20:30:41.1845143Z 2025-05-07T20:30:41.1845147Z 2025-05-07T20:30:41.1845150Z 2025-05-07T20:30:41.1845154Z 2025-05-07T20:30:41.1845157Z 2025-05-07T20:30:41.1845161Z 2025-05-07T20:30:41.1845165Z 2025-05-07T20:30:41.1845415Z  done 2025-05-07T20:30:41.2849042Z Preparing transaction: \ done 2025-05-07T20:30:41.3853434Z Verifying transaction: / done 2025-05-07T20:30:43.2885456Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:43.4144684Z [TEST] Checking imports ... 2025-05-07T20:30:47.3962383Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:47.3974632Z [TEST] Setting feature flags ... 2025-05-07T20:30:47.3975300Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:47.3975818Z 2025-05-07T20:30:47.8233474Z 2025-05-07T20:30:47.8234290Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:47.8235082Z ################################################################################ 2025-05-07T20:30:47.8235639Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:47.8236053Z # 2025-05-07T20:30:47.8254456Z # [2025-05-07T20:30:47.825Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:47.8255071Z ################################################################################ 2025-05-07T20:30:47.8255316Z 2025-05-07T20:30:47.8262974Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:47.8292177Z ./attention/gqa_test.py 2025-05-07T20:30:47.8292706Z ./coalesce/coalesce_test.py 2025-05-07T20:30:47.8293568Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:47.8294047Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:47.8294636Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:47.8295111Z ./moe/activation_test.py 2025-05-07T20:30:47.8295497Z ./moe/gather_scatter_test.py 2025-05-07T20:30:47.8296029Z ./moe/layers_test.py 2025-05-07T20:30:47.8296472Z ./moe/shuffling_test.py 2025-05-07T20:30:47.8296979Z ./quantize/quantize_test.py 2025-05-07T20:30:47.8297247Z 2025-05-07T20:30:47.8297432Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:47.8297810Z 2025-05-07T20:30:47.8312925Z ################################################################################ 2025-05-07T20:30:47.8327673Z # [2025-05-07T20:30:47.832Z] Run Python Test Suite: 2025-05-07T20:30:47.8328123Z # ./attention/gqa_test.py 2025-05-07T20:30:47.8328498Z ################################################################################ 2025-05-07T20:30:47.8352441Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:47.8353120Z 2025-05-07T20:30:50.3666744Z ============================= test session starts ============================== 2025-05-07T20:30:50.3667880Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:50.3668490Z cachedir: .pytest_cache 2025-05-07T20:30:50.3669417Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:50.3670297Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:50.3670810Z plugins: hypothesis-6.131.14 2025-05-07T20:30:52.0469357Z collecting ... collected 2 items 2025-05-07T20:30:52.0469721Z 2025-05-07T20:31:30.4408592Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:31:30.4409292Z self=, 2025-05-07T20:31:30.4410469Z int4_kv=False, 2025-05-07T20:31:30.4410842Z num_groups=1, 2025-05-07T20:31:30.4412166Z B=1, 2025-05-07T20:31:30.4414650Z MAX_T=4, 2025-05-07T20:31:30.4415356Z N_H_L=1, 2025-05-07T20:31:30.4416429Z ) 2025-05-07T20:31:30.4417118Z Trying example: test_gqa( 2025-05-07T20:31:30.4417586Z self=, 2025-05-07T20:31:30.4418141Z int4_kv=True, 2025-05-07T20:31:30.4418559Z num_groups=1, 2025-05-07T20:31:30.4418908Z B=1, 2025-05-07T20:31:30.4419310Z MAX_T=4, 2025-05-07T20:31:30.4419696Z N_H_L=1, 2025-05-07T20:31:30.4420029Z ) 2025-05-07T20:31:30.4420433Z Trying example: test_gqa( 2025-05-07T20:31:30.4420921Z self=, 2025-05-07T20:31:30.4421372Z int4_kv=True, 2025-05-07T20:31:30.4421788Z num_groups=4, 2025-05-07T20:31:30.4422169Z B=23, 2025-05-07T20:31:30.4422484Z MAX_T=33, 2025-05-07T20:31:30.4422856Z N_H_L=68, 2025-05-07T20:31:30.4423245Z ) 2025-05-07T20:31:30.4423601Z Trying example: test_gqa( 2025-05-07T20:31:30.4433466Z self=, 2025-05-07T20:31:30.4433892Z int4_kv=True, 2025-05-07T20:31:30.4434168Z num_groups=4, 2025-05-07T20:31:30.4434422Z B=77, 2025-05-07T20:31:30.4434653Z MAX_T=4, 2025-05-07T20:31:30.4434899Z N_H_L=1, 2025-05-07T20:31:30.4435144Z ) 2025-05-07T20:31:30.4435384Z Trying example: test_gqa( 2025-05-07T20:31:30.4435752Z self=, 2025-05-07T20:31:30.4436142Z int4_kv=True, 2025-05-07T20:31:30.4436407Z num_groups=4, 2025-05-07T20:31:30.4436666Z B=77, 2025-05-07T20:31:30.4436905Z MAX_T=52, 2025-05-07T20:31:30.4437148Z N_H_L=67, 2025-05-07T20:31:30.4437397Z ) 2025-05-07T20:31:30.4437642Z Trying example: test_gqa( 2025-05-07T20:31:30.4438001Z self=, 2025-05-07T20:31:30.4438392Z int4_kv=False, 2025-05-07T20:31:30.4438655Z num_groups=4, 2025-05-07T20:31:30.4438917Z B=57, 2025-05-07T20:31:30.4439144Z MAX_T=45, 2025-05-07T20:31:30.4439388Z N_H_L=120, 2025-05-07T20:31:30.4439633Z ) 2025-05-07T20:31:30.4439865Z Trying example: test_gqa( 2025-05-07T20:31:30.4440220Z self=, 2025-05-07T20:31:30.4440619Z int4_kv=True, 2025-05-07T20:31:30.4440913Z num_groups=4, 2025-05-07T20:31:30.4441173Z B=52, 2025-05-07T20:31:30.4441409Z MAX_T=42, 2025-05-07T20:31:30.4441644Z N_H_L=53, 2025-05-07T20:31:30.4441885Z ) 2025-05-07T20:31:30.4442136Z Trying example: test_gqa( 2025-05-07T20:31:30.4442488Z self=, 2025-05-07T20:31:30.4442872Z int4_kv=True, 2025-05-07T20:31:30.4443139Z num_groups=1, 2025-05-07T20:31:30.4443388Z B=77, 2025-05-07T20:31:30.4443628Z MAX_T=95, 2025-05-07T20:31:30.4443881Z N_H_L=53, 2025-05-07T20:31:30.4444116Z ) 2025-05-07T20:31:30.4444361Z Trying example: test_gqa( 2025-05-07T20:31:30.4444765Z self=, 2025-05-07T20:31:30.4445160Z int4_kv=True, 2025-05-07T20:31:30.4445421Z num_groups=4, 2025-05-07T20:31:30.4445670Z B=113, 2025-05-07T20:31:30.4445904Z MAX_T=48, 2025-05-07T20:31:30.4446151Z N_H_L=96, 2025-05-07T20:31:30.4446383Z ) 2025-05-07T20:31:30.4446627Z Trying example: test_gqa( 2025-05-07T20:31:30.4446987Z self=, 2025-05-07T20:31:30.4447365Z int4_kv=False, 2025-05-07T20:31:30.4447629Z num_groups=1, 2025-05-07T20:31:30.4447883Z B=51, 2025-05-07T20:31:30.4448415Z MAX_T=61, 2025-05-07T20:31:30.4448670Z N_H_L=69, 2025-05-07T20:31:30.4448909Z ) 2025-05-07T20:31:30.4449140Z Trying example: test_gqa( 2025-05-07T20:31:30.4449496Z self=, 2025-05-07T20:31:30.4449879Z int4_kv=False, 2025-05-07T20:31:30.4450132Z num_groups=4, 2025-05-07T20:31:30.4450396Z B=17, 2025-05-07T20:31:30.4450637Z MAX_T=113, 2025-05-07T20:31:30.4450878Z N_H_L=65, 2025-05-07T20:31:30.4451220Z ) 2025-05-07T20:31:30.4451464Z Trying example: test_gqa( 2025-05-07T20:31:30.4451818Z self=, 2025-05-07T20:31:30.4452195Z int4_kv=False, 2025-05-07T20:31:30.4452456Z num_groups=4, 2025-05-07T20:31:30.4452712Z B=17, 2025-05-07T20:31:30.4452939Z MAX_T=65, 2025-05-07T20:31:30.4453298Z N_H_L=65, 2025-05-07T20:31:30.4453539Z ) 2025-05-07T20:31:30.4453775Z Trying example: test_gqa( 2025-05-07T20:31:30.4454130Z self=, 2025-05-07T20:31:30.4454526Z int4_kv=False, 2025-05-07T20:31:30.4454781Z num_groups=4, 2025-05-07T20:31:30.4455036Z B=65, 2025-05-07T20:31:30.4455271Z MAX_T=65, 2025-05-07T20:31:30.4455506Z N_H_L=65, 2025-05-07T20:31:30.4455750Z ) 2025-05-07T20:31:30.4455990Z Trying example: test_gqa( 2025-05-07T20:31:30.4456337Z self=, 2025-05-07T20:31:30.4456727Z int4_kv=False, 2025-05-07T20:31:30.4457033Z num_groups=1, 2025-05-07T20:31:30.4457303Z B=6, 2025-05-07T20:31:30.4457554Z MAX_T=108, 2025-05-07T20:31:30.4457814Z N_H_L=14, 2025-05-07T20:31:30.4458057Z ) 2025-05-07T20:31:30.4458308Z Trying example: test_gqa( 2025-05-07T20:31:30.4458633Z self=, 2025-05-07T20:31:30.4458939Z int4_kv=False, 2025-05-07T20:31:30.4459156Z num_groups=1, 2025-05-07T20:31:30.4459646Z B=6, 2025-05-07T20:31:30.4459832Z MAX_T=14, 2025-05-07T20:31:30.4460033Z N_H_L=14, 2025-05-07T20:31:30.4460227Z ) 2025-05-07T20:31:30.4460437Z Trying example: test_gqa( 2025-05-07T20:31:30.4460728Z self=, 2025-05-07T20:31:30.4461042Z int4_kv=False, 2025-05-07T20:31:30.4461258Z num_groups=1, 2025-05-07T20:31:30.4461461Z B=6, 2025-05-07T20:31:30.4461651Z MAX_T=6, 2025-05-07T20:31:30.4461848Z N_H_L=14, 2025-05-07T20:31:30.4462037Z ) 2025-05-07T20:31:30.4462231Z Trying example: test_gqa( 2025-05-07T20:31:30.4462533Z self=, 2025-05-07T20:31:30.4462836Z int4_kv=False, 2025-05-07T20:31:30.4463057Z num_groups=1, 2025-05-07T20:31:30.4463259Z B=6, 2025-05-07T20:31:30.4463447Z MAX_T=6, 2025-05-07T20:31:30.4463648Z N_H_L=6, 2025-05-07T20:31:30.4463846Z ) 2025-05-07T20:31:30.4464034Z Trying example: test_gqa( 2025-05-07T20:31:30.4464326Z self=, 2025-05-07T20:31:30.4464640Z int4_kv=False, 2025-05-07T20:31:30.4464852Z num_groups=1, 2025-05-07T20:31:30.4465064Z B=70, 2025-05-07T20:31:30.4465262Z MAX_T=94, 2025-05-07T20:31:30.4465455Z N_H_L=78, 2025-05-07T20:31:30.4465654Z ) 2025-05-07T20:31:30.4465853Z Trying example: test_gqa( 2025-05-07T20:31:30.4466143Z self=, 2025-05-07T20:31:30.4466456Z int4_kv=False, 2025-05-07T20:31:30.4466665Z num_groups=1, 2025-05-07T20:31:30.4466866Z B=78, 2025-05-07T20:31:30.4467062Z MAX_T=94, 2025-05-07T20:31:30.4467268Z N_H_L=78, 2025-05-07T20:31:30.4467459Z ) 2025-05-07T20:31:30.4467661Z Trying example: test_gqa( 2025-05-07T20:31:30.4467955Z self=, 2025-05-07T20:31:30.4468269Z int4_kv=False, 2025-05-07T20:31:30.4468481Z num_groups=1, 2025-05-07T20:31:30.4468693Z B=94, 2025-05-07T20:31:30.4468894Z MAX_T=94, 2025-05-07T20:31:30.4469091Z N_H_L=78, 2025-05-07T20:31:30.4469288Z ) 2025-05-07T20:31:30.4469490Z Trying example: test_gqa( 2025-05-07T20:31:30.4469935Z self=, 2025-05-07T20:31:30.4470249Z int4_kv=False, 2025-05-07T20:31:30.4470466Z num_groups=1, 2025-05-07T20:31:30.4470690Z B=94, 2025-05-07T20:31:30.4470898Z MAX_T=94, 2025-05-07T20:31:30.4471099Z N_H_L=94, 2025-05-07T20:31:30.4471292Z ) 2025-05-07T20:31:30.4471496Z Trying example: test_gqa( 2025-05-07T20:31:30.4471794Z self=, 2025-05-07T20:31:30.4472108Z int4_kv=False, 2025-05-07T20:31:30.4472451Z num_groups=4, 2025-05-07T20:31:30.4472657Z B=41, 2025-05-07T20:31:30.4472838Z MAX_T=105, 2025-05-07T20:31:30.4473044Z N_H_L=126, 2025-05-07T20:31:30.4473240Z ) 2025-05-07T20:31:30.4473438Z Trying example: test_gqa( 2025-05-07T20:31:30.4473730Z self=, 2025-05-07T20:31:30.4474036Z int4_kv=False, 2025-05-07T20:31:30.4474235Z num_groups=4, 2025-05-07T20:31:30.4474438Z B=105, 2025-05-07T20:31:30.4474634Z MAX_T=105, 2025-05-07T20:31:30.4474831Z N_H_L=126, 2025-05-07T20:31:30.4475037Z ) 2025-05-07T20:31:30.4475236Z Trying example: test_gqa( 2025-05-07T20:31:30.4475516Z self=, 2025-05-07T20:31:30.4475823Z int4_kv=False, 2025-05-07T20:31:30.4476043Z num_groups=4, 2025-05-07T20:31:30.4476243Z B=105, 2025-05-07T20:31:30.4476439Z MAX_T=105, 2025-05-07T20:31:30.4476640Z N_H_L=105, 2025-05-07T20:31:30.4476831Z ) 2025-05-07T20:31:30.4477022Z Trying example: test_gqa( 2025-05-07T20:31:30.4477320Z self=, 2025-05-07T20:31:30.4477623Z int4_kv=True, 2025-05-07T20:31:30.4477825Z num_groups=1, 2025-05-07T20:31:30.4478025Z B=95, 2025-05-07T20:31:30.4478208Z MAX_T=114, 2025-05-07T20:31:30.4478402Z N_H_L=43, 2025-05-07T20:31:30.4478587Z ) 2025-05-07T20:31:30.4478777Z Trying example: test_gqa( 2025-05-07T20:31:30.4479055Z self=, 2025-05-07T20:31:30.4479362Z int4_kv=True, 2025-05-07T20:31:30.4479567Z num_groups=1, 2025-05-07T20:31:30.4479765Z B=43, 2025-05-07T20:31:30.4479952Z MAX_T=114, 2025-05-07T20:31:30.4480149Z N_H_L=43, 2025-05-07T20:31:30.4480342Z ) 2025-05-07T20:31:30.4480562Z Trying example: test_gqa( 2025-05-07T20:31:30.4480873Z self=, 2025-05-07T20:31:30.4481175Z int4_kv=True, 2025-05-07T20:31:30.4481388Z num_groups=1, 2025-05-07T20:31:30.4481602Z B=43, 2025-05-07T20:31:30.4481792Z MAX_T=43, 2025-05-07T20:31:30.4481994Z N_H_L=43, 2025-05-07T20:31:30.4482190Z ) 2025-05-07T20:31:30.4482380Z Trying example: test_gqa( 2025-05-07T20:31:30.4482672Z self=, 2025-05-07T20:31:30.4482979Z int4_kv=False, 2025-05-07T20:31:30.4483183Z num_groups=1, 2025-05-07T20:31:30.4483392Z B=21, 2025-05-07T20:31:30.4483580Z MAX_T=38, 2025-05-07T20:31:30.4483770Z N_H_L=42, 2025-05-07T20:31:30.4483968Z ) 2025-05-07T20:31:30.4484163Z Trying example: test_gqa( 2025-05-07T20:31:30.4484451Z self=, 2025-05-07T20:31:30.4484769Z int4_kv=False, 2025-05-07T20:31:30.4484985Z num_groups=1, 2025-05-07T20:31:30.4485188Z B=38, 2025-05-07T20:31:30.4485381Z MAX_T=38, 2025-05-07T20:31:30.4485574Z N_H_L=42, 2025-05-07T20:31:30.4485766Z ) 2025-05-07T20:31:30.4485961Z Trying example: test_gqa( 2025-05-07T20:31:30.4486255Z self=, 2025-05-07T20:31:30.4486569Z int4_kv=False, 2025-05-07T20:31:30.4486779Z num_groups=1, 2025-05-07T20:31:30.4486988Z B=38, 2025-05-07T20:31:30.4487170Z MAX_T=42, 2025-05-07T20:31:30.4487368Z N_H_L=42, 2025-05-07T20:31:30.4487561Z ) 2025-05-07T20:31:30.4487761Z Trying example: test_gqa( 2025-05-07T20:31:30.4488045Z self=, 2025-05-07T20:31:30.4488358Z int4_kv=False, 2025-05-07T20:31:30.4488569Z num_groups=1, 2025-05-07T20:31:30.4488769Z B=42, 2025-05-07T20:31:30.4488956Z MAX_T=42, 2025-05-07T20:31:30.4489249Z N_H_L=42, 2025-05-07T20:31:30.4489442Z ) 2025-05-07T20:31:30.4489636Z Trying example: test_gqa( 2025-05-07T20:31:30.4489925Z self=, 2025-05-07T20:31:30.4490227Z int4_kv=True, 2025-05-07T20:31:30.4490445Z num_groups=1, 2025-05-07T20:31:30.4490654Z B=74, 2025-05-07T20:31:30.4490837Z MAX_T=20, 2025-05-07T20:31:30.4491039Z N_H_L=15, 2025-05-07T20:31:30.4491231Z ) 2025-05-07T20:31:30.4491501Z Trying example: test_gqa( 2025-05-07T20:31:30.4491794Z self=, 2025-05-07T20:31:30.4492100Z int4_kv=True, 2025-05-07T20:31:30.4492308Z num_groups=1, 2025-05-07T20:31:30.4492521Z B=20, 2025-05-07T20:31:30.4492711Z MAX_T=20, 2025-05-07T20:31:30.4492904Z N_H_L=15, 2025-05-07T20:31:30.4493149Z ) 2025-05-07T20:31:30.4493348Z Trying example: test_gqa( 2025-05-07T20:31:30.4493636Z self=, 2025-05-07T20:31:30.4493952Z int4_kv=True, 2025-05-07T20:31:30.4494166Z num_groups=1, 2025-05-07T20:31:30.4494371Z B=20, 2025-05-07T20:31:30.4494569Z MAX_T=15, 2025-05-07T20:31:30.4494773Z N_H_L=15, 2025-05-07T20:31:30.4494963Z ) 2025-05-07T20:31:30.4495165Z Trying example: test_gqa( 2025-05-07T20:31:30.4495456Z self=, 2025-05-07T20:31:30.4495766Z int4_kv=True, 2025-05-07T20:31:30.4495970Z num_groups=1, 2025-05-07T20:31:30.4496179Z B=15, 2025-05-07T20:31:30.4496369Z MAX_T=20, 2025-05-07T20:31:30.4496557Z N_H_L=15, 2025-05-07T20:31:30.4496742Z ) 2025-05-07T20:31:30.4496939Z Trying example: test_gqa( 2025-05-07T20:31:30.4497221Z self=, 2025-05-07T20:31:30.4497529Z int4_kv=True, 2025-05-07T20:31:30.4497742Z num_groups=1, 2025-05-07T20:31:30.4497947Z B=15, 2025-05-07T20:31:30.4498136Z MAX_T=15, 2025-05-07T20:31:30.4498328Z N_H_L=15, 2025-05-07T20:31:30.4498510Z ) 2025-05-07T20:31:30.4498701Z Trying example: test_gqa( 2025-05-07T20:31:30.4498992Z self=, 2025-05-07T20:31:30.4499295Z int4_kv=False, 2025-05-07T20:31:30.4499504Z num_groups=4, 2025-05-07T20:31:30.4499714Z B=117, 2025-05-07T20:31:30.4499899Z MAX_T=104, 2025-05-07T20:31:30.4500099Z N_H_L=69, 2025-05-07T20:31:30.4500292Z ) 2025-05-07T20:31:30.4500480Z Trying example: test_gqa( 2025-05-07T20:31:30.4500819Z self=, 2025-05-07T20:31:30.4501130Z int4_kv=False, 2025-05-07T20:31:30.4501335Z num_groups=4, 2025-05-07T20:31:30.4501539Z B=117, 2025-05-07T20:31:30.4501739Z MAX_T=117, 2025-05-07T20:31:30.4501936Z N_H_L=69, 2025-05-07T20:31:30.4502131Z ) 2025-05-07T20:31:30.4502333Z Trying example: test_gqa( 2025-05-07T20:31:30.4502631Z self=, 2025-05-07T20:31:30.4502940Z int4_kv=False, 2025-05-07T20:31:30.4503166Z num_groups=4, 2025-05-07T20:31:30.4503371Z B=69, 2025-05-07T20:31:30.4503568Z MAX_T=117, 2025-05-07T20:31:30.4503774Z N_H_L=69, 2025-05-07T20:31:30.4503976Z ) 2025-05-07T20:31:30.4504169Z Trying example: test_gqa( 2025-05-07T20:31:30.4504463Z self=, 2025-05-07T20:31:30.4504777Z int4_kv=False, 2025-05-07T20:31:30.4504990Z num_groups=4, 2025-05-07T20:31:30.4505204Z B=117, 2025-05-07T20:31:30.4505406Z MAX_T=69, 2025-05-07T20:31:30.4505602Z N_H_L=69, 2025-05-07T20:31:30.4505810Z ) 2025-05-07T20:31:30.4506000Z PASSED 2025-05-07T20:31:30.4609015Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:31:30.4609342Z 2025-05-07T20:31:30.4609494Z =========================== short test summary info ============================ 2025-05-07T20:31:30.4610198Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when CUDA is not available or xformers is not available 2025-05-07T20:31:30.4611055Z ======================== 1 passed, 1 skipped in 40.60s ========================= 2025-05-07T20:31:31.1118336Z 2025-05-07T20:31:31.1118919Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:31:31.1138933Z [TEST] Python test time for ./attention/gqa_test.py: 44 seconds 2025-05-07T20:31:31.1139224Z 2025-05-07T20:31:31.1139228Z 2025-05-07T20:31:31.1139232Z 2025-05-07T20:31:31.1139236Z 2025-05-07T20:31:31.1159850Z ################################################################################ 2025-05-07T20:31:31.1175239Z # [2025-05-07T20:31:31.117Z] Run Python Test Suite: 2025-05-07T20:31:31.1175632Z # ./coalesce/coalesce_test.py 2025-05-07T20:31:31.1176011Z ################################################################################ 2025-05-07T20:31:31.1201116Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:31:31.1201735Z 2025-05-07T20:31:33.2781041Z ============================= test session starts ============================== 2025-05-07T20:31:33.2781678Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:33.2782205Z cachedir: .pytest_cache 2025-05-07T20:31:33.2782780Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:33.2783494Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:33.2783911Z plugins: hypothesis-6.131.14 2025-05-07T20:31:35.0191018Z collecting ... collected 1 item 2025-05-07T20:31:35.0191426Z 2025-05-07T20:31:35.7737963Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:31:35.7738287Z 2025-05-07T20:31:35.7738503Z ============================== 1 passed in 2.62s =============================== 2025-05-07T20:31:36.4056382Z 2025-05-07T20:31:36.4057024Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:31:36.4076093Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:31:36.4076504Z 2025-05-07T20:31:36.4076509Z 2025-05-07T20:31:36.4076513Z 2025-05-07T20:31:36.4076517Z 2025-05-07T20:31:36.4098684Z ################################################################################ 2025-05-07T20:31:36.4113840Z # [2025-05-07T20:31:36.411Z] Run Python Test Suite: 2025-05-07T20:31:36.4114304Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:31:36.4114625Z ################################################################################ 2025-05-07T20:31:36.4138561Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:31:36.4139179Z 2025-05-07T20:31:38.5711203Z ============================= test session starts ============================== 2025-05-07T20:31:38.5711997Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:38.5712550Z cachedir: .pytest_cache 2025-05-07T20:31:38.5713127Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:38.5713843Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:38.5714250Z plugins: hypothesis-6.131.14 2025-05-07T20:31:40.2748666Z collecting ... collected 5 items 2025-05-07T20:31:40.2749264Z 2025-05-07T20:31:40.2761656Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:40.2770979Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:40.2779235Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:40.2787312Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:40.2806542Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:40.2807018Z 2025-05-07T20:31:40.2807516Z =========================== short test summary info ============================ 2025-05-07T20:31:40.2808276Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.2809194Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.2810269Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.2811181Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.2812087Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:40.2812733Z ============================== 5 skipped in 1.83s ============================== 2025-05-07T20:31:40.8505891Z 2025-05-07T20:31:40.8506646Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:40.8526112Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds 2025-05-07T20:31:40.8526522Z 2025-05-07T20:31:40.8526529Z 2025-05-07T20:31:40.8526534Z 2025-05-07T20:31:40.8526539Z 2025-05-07T20:31:40.8548805Z ################################################################################ 2025-05-07T20:31:40.8564786Z # [2025-05-07T20:31:40.856Z] Run Python Test Suite: 2025-05-07T20:31:40.8565281Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:40.8565597Z ################################################################################ 2025-05-07T20:31:40.8589644Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:40.8590313Z 2025-05-07T20:31:43.0100281Z ============================= test session starts ============================== 2025-05-07T20:31:43.0101084Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:43.0101604Z cachedir: .pytest_cache 2025-05-07T20:31:43.0102169Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:43.0102942Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:43.0103352Z plugins: hypothesis-6.131.14 2025-05-07T20:31:44.8020277Z collecting ... collected 2 items 2025-05-07T20:31:44.8020859Z 2025-05-07T20:31:44.8031178Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:44.8047449Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:44.8048039Z 2025-05-07T20:31:44.8048281Z =========================== short test summary info ============================ 2025-05-07T20:31:44.8048921Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:44.8049739Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:44.8058011Z ============================== 2 skipped in 1.92s ============================== 2025-05-07T20:31:45.3874004Z 2025-05-07T20:31:45.3874804Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:45.3895297Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds 2025-05-07T20:31:45.3895623Z 2025-05-07T20:31:45.3895627Z 2025-05-07T20:31:45.3895642Z 2025-05-07T20:31:45.3895646Z 2025-05-07T20:31:45.3916222Z ################################################################################ 2025-05-07T20:31:45.3931922Z # [2025-05-07T20:31:45.392Z] Run Python Test Suite: 2025-05-07T20:31:45.3932393Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:45.3932758Z ################################################################################ 2025-05-07T20:31:45.3957943Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:45.3958841Z 2025-05-07T20:31:47.5469987Z ============================= test session starts ============================== 2025-05-07T20:31:47.5470754Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:47.5471281Z cachedir: .pytest_cache 2025-05-07T20:31:47.5471856Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:47.5472562Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:47.5473002Z plugins: hypothesis-6.131.14 2025-05-07T20:31:49.2392758Z collecting ... collected 4 items 2025-05-07T20:31:49.2392962Z 2025-05-07T20:31:52.0040422Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:52.0125323Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:52.0223335Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:52.0314217Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:52.0314706Z 2025-05-07T20:31:52.0314919Z =========================== short test summary info ============================ 2025-05-07T20:31:52.0315856Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:52.0316938Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when xformers is not available 2025-05-07T20:31:52.0317541Z ============================== 4 skipped in 4.61s ============================== 2025-05-07T20:31:53.8937269Z 2025-05-07T20:31:53.8937866Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:53.8957358Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:53.8957742Z 2025-05-07T20:31:53.8957814Z 2025-05-07T20:31:53.8957901Z 2025-05-07T20:31:53.8957907Z 2025-05-07T20:31:53.8978511Z ################################################################################ 2025-05-07T20:31:53.8993563Z # [2025-05-07T20:31:53.899Z] Run Python Test Suite: 2025-05-07T20:31:53.8994021Z # ./moe/activation_test.py 2025-05-07T20:31:53.8994393Z ################################################################################ 2025-05-07T20:31:53.9020174Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:53.9020841Z 2025-05-07T20:31:56.0571068Z ============================= test session starts ============================== 2025-05-07T20:31:56.0571707Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:56.0572225Z cachedir: .pytest_cache 2025-05-07T20:31:56.0572803Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:56.0573622Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:56.0574034Z plugins: hypothesis-6.131.14 2025-05-07T20:31:57.7163007Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:57.8271955Z collecting ... collected 2 items 2025-05-07T20:31:57.8272153Z 2025-05-07T20:32:03.1626261Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:32:03.1627455Z self=, 2025-05-07T20:32:03.1627851Z T=1, 2025-05-07T20:32:03.1628050Z D=5120, 2025-05-07T20:32:03.1628248Z contiguous=True, 2025-05-07T20:32:03.1628486Z compiled=True, 2025-05-07T20:32:03.1628706Z ) 2025-05-07T20:32:03.1628910Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1629294Z self=, 2025-05-07T20:32:03.1629862Z T=4096, 2025-05-07T20:32:03.1630053Z D=5120, 2025-05-07T20:32:03.1630259Z contiguous=True, 2025-05-07T20:32:03.1630498Z compiled=True, 2025-05-07T20:32:03.1630706Z ) 2025-05-07T20:32:03.1630914Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1631293Z self=, 2025-05-07T20:32:03.1631667Z T=4096, 2025-05-07T20:32:03.1631859Z D=7168, 2025-05-07T20:32:03.1632067Z contiguous=False, 2025-05-07T20:32:03.1632293Z compiled=False, 2025-05-07T20:32:03.1632512Z ) 2025-05-07T20:32:03.1632725Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1633092Z self=, 2025-05-07T20:32:03.1633470Z T=4096, 2025-05-07T20:32:03.1633668Z D=5120, 2025-05-07T20:32:03.1633871Z contiguous=False, 2025-05-07T20:32:03.1634097Z compiled=True, 2025-05-07T20:32:03.1634308Z ) 2025-05-07T20:32:03.1634512Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1634886Z self=, 2025-05-07T20:32:03.1635268Z T=1, 2025-05-07T20:32:03.1635461Z D=7168, 2025-05-07T20:32:03.1635660Z contiguous=True, 2025-05-07T20:32:03.1635895Z compiled=True, 2025-05-07T20:32:03.1636109Z ) 2025-05-07T20:32:03.1636307Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1636688Z self=, 2025-05-07T20:32:03.1637064Z T=1, 2025-05-07T20:32:03.1637253Z D=7168, 2025-05-07T20:32:03.1637465Z contiguous=False, 2025-05-07T20:32:03.1637704Z compiled=True, 2025-05-07T20:32:03.1637912Z ) 2025-05-07T20:32:03.1638124Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1638507Z self=, 2025-05-07T20:32:03.1638876Z T=4096, 2025-05-07T20:32:03.1639078Z D=5120, 2025-05-07T20:32:03.1639282Z contiguous=False, 2025-05-07T20:32:03.1639510Z compiled=False, 2025-05-07T20:32:03.1639733Z ) 2025-05-07T20:32:03.1639940Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1640319Z self=, 2025-05-07T20:32:03.1640691Z T=1, 2025-05-07T20:32:03.1640878Z D=7168, 2025-05-07T20:32:03.1641078Z contiguous=True, 2025-05-07T20:32:03.1641303Z compiled=False, 2025-05-07T20:32:03.1641518Z ) 2025-05-07T20:32:03.1641721Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1642092Z self=, 2025-05-07T20:32:03.1642475Z T=2048, 2025-05-07T20:32:03.1642678Z D=5120, 2025-05-07T20:32:03.1642878Z contiguous=True, 2025-05-07T20:32:03.1643115Z compiled=True, 2025-05-07T20:32:03.1643331Z ) 2025-05-07T20:32:03.1643526Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1643898Z self=, 2025-05-07T20:32:03.1644285Z T=2048, 2025-05-07T20:32:03.1644476Z D=7168, 2025-05-07T20:32:03.1644682Z contiguous=True, 2025-05-07T20:32:03.1644908Z compiled=True, 2025-05-07T20:32:03.1645115Z ) 2025-05-07T20:32:03.1645318Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1645694Z self=, 2025-05-07T20:32:03.1646077Z T=2048, 2025-05-07T20:32:03.1646266Z D=7168, 2025-05-07T20:32:03.1646472Z contiguous=True, 2025-05-07T20:32:03.1646714Z compiled=False, 2025-05-07T20:32:03.1646922Z ) 2025-05-07T20:32:03.1647127Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1647610Z self=, 2025-05-07T20:32:03.1647985Z T=128, 2025-05-07T20:32:03.1648180Z D=5120, 2025-05-07T20:32:03.1648393Z contiguous=False, 2025-05-07T20:32:03.1648620Z compiled=True, 2025-05-07T20:32:03.1648831Z ) 2025-05-07T20:32:03.1649040Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1649408Z self=, 2025-05-07T20:32:03.1649869Z T=128, 2025-05-07T20:32:03.1650078Z D=5120, 2025-05-07T20:32:03.1650274Z contiguous=True, 2025-05-07T20:32:03.1650509Z compiled=True, 2025-05-07T20:32:03.1650725Z ) 2025-05-07T20:32:03.1650924Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1651302Z self=, 2025-05-07T20:32:03.1651687Z T=16384, 2025-05-07T20:32:03.1651905Z D=5120, 2025-05-07T20:32:03.1652105Z contiguous=False, 2025-05-07T20:32:03.1652344Z compiled=True, 2025-05-07T20:32:03.1652563Z ) 2025-05-07T20:32:03.1652759Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1653268Z self=, 2025-05-07T20:32:03.1653656Z T=16384, 2025-05-07T20:32:03.1653855Z D=5120, 2025-05-07T20:32:03.1654057Z contiguous=False, 2025-05-07T20:32:03.1654286Z compiled=False, 2025-05-07T20:32:03.1654491Z ) 2025-05-07T20:32:03.1654700Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1655085Z self=, 2025-05-07T20:32:03.1655469Z T=128, 2025-05-07T20:32:03.1655667Z D=7168, 2025-05-07T20:32:03.1655864Z contiguous=True, 2025-05-07T20:32:03.1656094Z compiled=False, 2025-05-07T20:32:03.1656296Z ) 2025-05-07T20:32:03.1656493Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1656866Z self=, 2025-05-07T20:32:03.1657236Z T=128, 2025-05-07T20:32:03.1657438Z D=7168, 2025-05-07T20:32:03.1657651Z contiguous=False, 2025-05-07T20:32:03.1657870Z compiled=False, 2025-05-07T20:32:03.1658081Z ) 2025-05-07T20:32:03.1658282Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1658653Z self=, 2025-05-07T20:32:03.1659032Z T=1, 2025-05-07T20:32:03.1659540Z D=5120, 2025-05-07T20:32:03.1659746Z contiguous=False, 2025-05-07T20:32:03.1659978Z compiled=False, 2025-05-07T20:32:03.1660218Z ) 2025-05-07T20:32:03.1660417Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1660794Z self=, 2025-05-07T20:32:03.1661170Z T=1, 2025-05-07T20:32:03.1661375Z D=7168, 2025-05-07T20:32:03.1661646Z contiguous=False, 2025-05-07T20:32:03.1661930Z compiled=False, 2025-05-07T20:32:03.1662139Z ) 2025-05-07T20:32:03.1662352Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1662738Z self=, 2025-05-07T20:32:03.1663124Z T=4096, 2025-05-07T20:32:03.1663354Z D=5120, 2025-05-07T20:32:03.1663568Z contiguous=True, 2025-05-07T20:32:03.1663797Z compiled=False, 2025-05-07T20:32:03.1664007Z ) 2025-05-07T20:32:03.1664215Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1664584Z self=, 2025-05-07T20:32:03.1664963Z T=128, 2025-05-07T20:32:03.1665165Z D=7168, 2025-05-07T20:32:03.1665369Z contiguous=True, 2025-05-07T20:32:03.1665587Z compiled=True, 2025-05-07T20:32:03.1665798Z ) 2025-05-07T20:32:03.1666001Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1666368Z self=, 2025-05-07T20:32:03.1666746Z T=1, 2025-05-07T20:32:03.1666935Z D=5120, 2025-05-07T20:32:03.1667130Z contiguous=False, 2025-05-07T20:32:03.1667358Z compiled=True, 2025-05-07T20:32:03.1667567Z ) 2025-05-07T20:32:03.1667763Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1668312Z self=, 2025-05-07T20:32:03.1668692Z T=4096, 2025-05-07T20:32:03.1668880Z D=7168, 2025-05-07T20:32:03.1669078Z contiguous=True, 2025-05-07T20:32:03.1669304Z compiled=False, 2025-05-07T20:32:03.1669505Z ) 2025-05-07T20:32:03.1669710Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1670086Z self=, 2025-05-07T20:32:03.1670572Z T=4096, 2025-05-07T20:32:03.1670766Z D=7168, 2025-05-07T20:32:03.1670964Z contiguous=False, 2025-05-07T20:32:03.1671193Z compiled=True, 2025-05-07T20:32:03.1671398Z ) 2025-05-07T20:32:03.1671603Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1671976Z self=, 2025-05-07T20:32:03.1672345Z T=128, 2025-05-07T20:32:03.1672540Z D=5120, 2025-05-07T20:32:03.1672741Z contiguous=True, 2025-05-07T20:32:03.1672961Z compiled=False, 2025-05-07T20:32:03.1673180Z ) 2025-05-07T20:32:03.1673385Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1673755Z self=, 2025-05-07T20:32:03.1674134Z T=128, 2025-05-07T20:32:03.1674327Z D=5120, 2025-05-07T20:32:03.1674523Z contiguous=False, 2025-05-07T20:32:03.1674756Z compiled=False, 2025-05-07T20:32:03.1674970Z ) 2025-05-07T20:32:03.1675168Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1675553Z self=, 2025-05-07T20:32:03.1675937Z T=1, 2025-05-07T20:32:03.1676119Z D=5120, 2025-05-07T20:32:03.1676324Z contiguous=True, 2025-05-07T20:32:03.1676556Z compiled=False, 2025-05-07T20:32:03.1676759Z ) 2025-05-07T20:32:03.1676960Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1677334Z self=, 2025-05-07T20:32:03.1677710Z T=2048, 2025-05-07T20:32:03.1677893Z D=7168, 2025-05-07T20:32:03.1678094Z contiguous=False, 2025-05-07T20:32:03.1678319Z compiled=True, 2025-05-07T20:32:03.1678523Z ) 2025-05-07T20:32:03.1678729Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1679098Z self=, 2025-05-07T20:32:03.1679467Z T=2048, 2025-05-07T20:32:03.1679659Z D=7168, 2025-05-07T20:32:03.1679858Z contiguous=False, 2025-05-07T20:32:03.1680082Z compiled=False, 2025-05-07T20:32:03.1680298Z ) 2025-05-07T20:32:03.1680501Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1680867Z self=, 2025-05-07T20:32:03.1681241Z T=16384, 2025-05-07T20:32:03.1681441Z D=7168, 2025-05-07T20:32:03.1681635Z contiguous=False, 2025-05-07T20:32:03.1681864Z compiled=True, 2025-05-07T20:32:03.1682075Z ) 2025-05-07T20:32:03.1682267Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1682636Z self=, 2025-05-07T20:32:03.1683015Z T=16384, 2025-05-07T20:32:03.1683214Z D=7168, 2025-05-07T20:32:03.1683442Z contiguous=True, 2025-05-07T20:32:03.1683688Z compiled=True, 2025-05-07T20:32:03.1683901Z ) 2025-05-07T20:32:03.1684096Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1684469Z self=, 2025-05-07T20:32:03.1684844Z T=4096, 2025-05-07T20:32:03.1685037Z D=7168, 2025-05-07T20:32:03.1685244Z contiguous=True, 2025-05-07T20:32:03.1685471Z compiled=True, 2025-05-07T20:32:03.1685672Z ) 2025-05-07T20:32:03.1685873Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1686247Z self=, 2025-05-07T20:32:03.1686618Z T=2048, 2025-05-07T20:32:03.1686810Z D=5120, 2025-05-07T20:32:03.1687011Z contiguous=False, 2025-05-07T20:32:03.1687239Z compiled=False, 2025-05-07T20:32:03.1687451Z ) 2025-05-07T20:32:03.1687653Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1688120Z self=, 2025-05-07T20:32:03.1688498Z T=2048, 2025-05-07T20:32:03.1688694Z D=5120, 2025-05-07T20:32:03.1688890Z contiguous=True, 2025-05-07T20:32:03.1689120Z compiled=False, 2025-05-07T20:32:03.1689336Z ) 2025-05-07T20:32:03.1689535Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1689908Z self=, 2025-05-07T20:32:03.1690387Z T=128, 2025-05-07T20:32:03.1690582Z D=7168, 2025-05-07T20:32:03.1690774Z contiguous=False, 2025-05-07T20:32:03.1691010Z compiled=True, 2025-05-07T20:32:03.1691219Z ) 2025-05-07T20:32:03.1691416Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1691786Z self=, 2025-05-07T20:32:03.1692166Z T=16384, 2025-05-07T20:32:03.1692362Z D=5120, 2025-05-07T20:32:03.1692561Z contiguous=True, 2025-05-07T20:32:03.1692784Z compiled=True, 2025-05-07T20:32:03.1692994Z ) 2025-05-07T20:32:03.1693294Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1693714Z self=, 2025-05-07T20:32:03.1694085Z T=2048, 2025-05-07T20:32:03.1694276Z D=5120, 2025-05-07T20:32:03.1694474Z contiguous=False, 2025-05-07T20:32:03.1694701Z compiled=True, 2025-05-07T20:32:03.1694906Z ) 2025-05-07T20:32:03.1695108Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1695480Z self=, 2025-05-07T20:32:03.1695858Z T=16384, 2025-05-07T20:32:03.1696064Z D=5120, 2025-05-07T20:32:03.1696270Z contiguous=True, 2025-05-07T20:32:03.1696487Z compiled=False, 2025-05-07T20:32:03.1696696Z ) 2025-05-07T20:32:03.1696902Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1697265Z self=, 2025-05-07T20:32:03.1697644Z T=16384, 2025-05-07T20:32:03.1697842Z D=7168, 2025-05-07T20:32:03.1698042Z contiguous=False, 2025-05-07T20:32:03.1698274Z compiled=False, 2025-05-07T20:32:03.1698487Z ) 2025-05-07T20:32:03.1698683Z Trying example: test_silu_mul( 2025-05-07T20:32:03.1699057Z self=, 2025-05-07T20:32:03.1699433Z T=16384, 2025-05-07T20:32:03.1699625Z D=7168, 2025-05-07T20:32:03.1699826Z contiguous=True, 2025-05-07T20:32:03.1700063Z compiled=False, 2025-05-07T20:32:03.1700264Z ) 2025-05-07T20:32:03.1700469Z PASSED 2025-05-07T20:32:03.2308110Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.2309356Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:03.2310718Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.2312252Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.2313217Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2314522Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.2316221Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.2317197Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2318409Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.2319912Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.2320966Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2322244Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.2323537Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:03.2324741Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.2325934Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:03.2326757Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2327774Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:03.2328780Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:03.2329563Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:03.2330766Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.2332033Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.2333224Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.2334304Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:03.2335461Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.2336803Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.2337849Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.2338848Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.2339583Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:03.2340586Z W0507 20:32:03.228000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.2473405Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.2474766Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:03.2476096Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.2477520Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.2478483Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2479781Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.2481149Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.2482133Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2483354Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.2484713Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.2485774Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2496013Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.2497275Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:03.2498495Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.2499696Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:03.2500526Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2501542Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:03.2502790Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:03.2503588Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:03.2504777Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.2506116Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.2507214Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.2508242Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:03.2509397Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.2510730Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.2511780Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.2512679Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.2513425Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:03.2514421Z W0507 20:32:03.246000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.2871392Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.2872482Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:03.2873805Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.2875225Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.2876202Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2877498Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.2878866Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.2879843Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2882206Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.2883590Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.2884783Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2886051Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.2887292Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:03.2888499Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.2889696Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:03.2890519Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2891537Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:03.2892547Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:03.2893420Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:03.2894614Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.2895893Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.2896993Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.2898022Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:03.2899194Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.2900529Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.2901585Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.2902491Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.2903228Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:03.2904313Z W0507 20:32:03.285000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.2910112Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:03.2911150Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:03.2912575Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:03.2913992Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:03.2914967Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2916249Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:03.2917622Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.2918601Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2919821Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:03.2921189Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.2922237Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2923564Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:03.2924794Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:03.2926006Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:03.2927205Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:03.2928024Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:03.2929047Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:03.2930054Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:03.2930848Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:03.2932131Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:03.2933480Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:03.2934690Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:03.2935718Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:03.2936890Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:03.2938237Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:03.2939285Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.2940195Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:03.2940932Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:03.2941938Z W0507 20:32:03.289000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.7142597Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.7143359Z self=, 2025-05-07T20:32:03.7143787Z T=1, 2025-05-07T20:32:03.7143980Z D=5120, 2025-05-07T20:32:03.7144186Z scale_ub=None, 2025-05-07T20:32:03.7144412Z contiguous=True, 2025-05-07T20:32:03.7144644Z compiled=True, 2025-05-07T20:32:03.7144882Z ) 2025-05-07T20:32:03.7145213Z self = 2025-05-07T20:32:03.7145704Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:03.7145965Z 2025-05-07T20:32:03.7146046Z @given( 2025-05-07T20:32:03.7146288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:03.7146618Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:03.7146925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:03.7147269Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:03.7147608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:03.7147895Z ) 2025-05-07T20:32:03.7148254Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:03.7148700Z def test_silu_mul_quant( 2025-05-07T20:32:03.7148953Z self, 2025-05-07T20:32:03.7149145Z T: int, 2025-05-07T20:32:03.7149352Z D: int, 2025-05-07T20:32:03.7149582Z scale_ub: Optional[float], 2025-05-07T20:32:03.7149852Z contiguous: bool, 2025-05-07T20:32:03.7150100Z compiled: bool, 2025-05-07T20:32:03.7150332Z ) -> None: 2025-05-07T20:32:03.7150544Z torch.manual_seed(2025) 2025-05-07T20:32:03.7150788Z 2025-05-07T20:32:03.7151069Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:03.7151413Z 2025-05-07T20:32:03.7151615Z x_sign = torch.sign(x) 2025-05-07T20:32:03.7151915Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:03.7152565Z x = x_sign * x_clamp 2025-05-07T20:32:03.7152818Z x0 = x[:, :D] 2025-05-07T20:32:03.7153042Z x1 = x[:, D:] 2025-05-07T20:32:03.7153277Z 2025-05-07T20:32:03.7153493Z if contiguous: 2025-05-07T20:32:03.7153754Z x0 = x0.contiguous() 2025-05-07T20:32:03.7154019Z x1 = x1.contiguous() 2025-05-07T20:32:03.7154255Z 2025-05-07T20:32:03.7154449Z if scale_ub is not None: 2025-05-07T20:32:03.7154883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:03.7155216Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:03.7155530Z ) 2025-05-07T20:32:03.7155733Z else: 2025-05-07T20:32:03.7155953Z scale_ub_tensor = None 2025-05-07T20:32:03.7156204Z 2025-05-07T20:32:03.7156443Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.7156759Z op = silu_mul_quant 2025-05-07T20:32:03.7157005Z if compiled: 2025-05-07T20:32:03.7157267Z op = torch.compile(op) 2025-05-07T20:32:03.7157574Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:03.7157850Z 2025-05-07T20:32:03.7158049Z y_fp8, y_scale = fn() 2025-05-07T20:32:03.7158343Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:03.7158642Z 2025-05-07T20:32:03.7158886Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:03.7159627Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:03.7159977Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:03.7160292Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:03.7160655Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:03.7160963Z 2025-05-07T20:32:03.7161169Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:03.7161363Z 2025-05-07T20:32:03.7161472Z moe/activation_test.py:126: 2025-05-07T20:32:03.7161773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.7162110Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:03.7162441Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:03.7163224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:03.7163968Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:03.7164521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:03.7165196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:03.7165874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:03.7166587Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:03.7167312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:03.7167949Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:03.7168541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:03.7169051Z fn() 2025-05-07T20:32:03.7169563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:03.7170136Z self.fn.run( 2025-05-07T20:32:03.7170599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:03.7171125Z kernel = self.compile( 2025-05-07T20:32:03.7171662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:03.7172299Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:03.7172841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:03.7173135Z 2025-05-07T20:32:03.7173349Z self = 2025-05-07T20:32:03.7174419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:03.7176028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a7053bec0>} 2025-05-07T20:32:03.7177355Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:03.7178377Z context = 2025-05-07T20:32:03.7178675Z 2025-05-07T20:32:03.7178856Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:03.7179381Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:03.7179860Z module_map=module_map) 2025-05-07T20:32:03.7180242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:03.7180613Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:03.7180884Z E ^ 2025-05-07T20:32:03.7181354Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:03.7181801Z 2025-05-07T20:32:03.7182227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:03.7182734Z 2025-05-07T20:32:03.7182847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:03.7183281Z self=, 2025-05-07T20:32:03.7183722Z T=2048, 2025-05-07T20:32:03.7183915Z D=5120, 2025-05-07T20:32:03.7184106Z scale_ub=1200.0, 2025-05-07T20:32:03.7184338Z contiguous=True, 2025-05-07T20:32:03.7184570Z compiled=False, 2025-05-07T20:32:03.7184783Z ) 2025-05-07T20:32:04.0173089Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.0175143Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:04.0177797Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.0180620Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.0182528Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.0184445Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.0185809Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.0186779Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.0188342Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.0189695Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.0190894Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.0192154Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.0193415Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:04.0194639Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.0195819Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:04.0196644Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.0197655Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:04.0198667Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:04.0199449Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:04.0200633Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.0201905Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.0203010Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:04.0204044Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:04.0205214Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.0206544Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.0207592Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.0208493Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.0209225Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:04.0210300Z W0507 20:32:04.013000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.0992389Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.0993582Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:04.0996649Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.0999458Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.1001379Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.1003839Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.1005263Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.1006247Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.1007469Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.1008823Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.1009875Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.1011142Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.1012375Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:04.1013734Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.1014918Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:04.1015736Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.1016752Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:04.1017760Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:04.1018685Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:04.1019883Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.1021146Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.1022348Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:04.1023371Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:04.1024530Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.1025862Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.1026911Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.1027815Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.1028541Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:04.1029535Z W0507 20:32:04.096000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.3341742Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.3342787Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:04.3344094Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.3345511Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.3346474Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.3347763Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.3349120Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.3350098Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.3351304Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.3352968Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.3354075Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.3355334Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.3356694Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:04.3357898Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.3359092Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:04.3360168Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.3361178Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:04.3362186Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:04.3362968Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:04.3364158Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.3365424Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.3366524Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:04.3367551Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:04.3368711Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.3370042Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.3371086Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.3371985Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.3372721Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:04.3373812Z W0507 20:32:04.331000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.3447241Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.3449449Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:04.3452069Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.3454379Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.3455343Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.3456623Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.3457981Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.3458948Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.3460381Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.3461742Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.3462782Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.3464041Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.3465271Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:04.3466462Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.3467654Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:04.3468465Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.3469472Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:04.3470477Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:04.3471255Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:04.3472559Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.3473877Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.3474970Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:04.3476591Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:04.3477750Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.3479089Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.3480141Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.3481037Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.3481770Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:04.3482777Z W0507 20:32:04.341000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6908339Z self = 2025-05-07T20:32:04.6908899Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:04.6909182Z 2025-05-07T20:32:04.6909266Z @given( 2025-05-07T20:32:04.6909534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.6909848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.6910163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.6910507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.6910845Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.6911131Z ) 2025-05-07T20:32:04.6911487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.6911942Z def test_silu_mul_quant( 2025-05-07T20:32:04.6912189Z self, 2025-05-07T20:32:04.6912399Z T: int, 2025-05-07T20:32:04.6912606Z D: int, 2025-05-07T20:32:04.6912825Z scale_ub: Optional[float], 2025-05-07T20:32:04.6913113Z contiguous: bool, 2025-05-07T20:32:04.6913364Z compiled: bool, 2025-05-07T20:32:04.6913592Z ) -> None: 2025-05-07T20:32:04.6913819Z torch.manual_seed(2025) 2025-05-07T20:32:04.6914071Z 2025-05-07T20:32:04.6914357Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.6914714Z 2025-05-07T20:32:04.6914924Z x_sign = torch.sign(x) 2025-05-07T20:32:04.6915223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.6915546Z x = x_sign * x_clamp 2025-05-07T20:32:04.6915807Z x0 = x[:, :D] 2025-05-07T20:32:04.6916032Z x1 = x[:, D:] 2025-05-07T20:32:04.6916248Z 2025-05-07T20:32:04.6916441Z if contiguous: 2025-05-07T20:32:04.6916685Z x0 = x0.contiguous() 2025-05-07T20:32:04.6916952Z x1 = x1.contiguous() 2025-05-07T20:32:04.6917206Z 2025-05-07T20:32:04.6917410Z if scale_ub is not None: 2025-05-07T20:32:04.6917687Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.6918030Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.6918350Z ) 2025-05-07T20:32:04.6918564Z else: 2025-05-07T20:32:04.6919128Z scale_ub_tensor = None 2025-05-07T20:32:04.6919393Z 2025-05-07T20:32:04.6919631Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6919945Z op = silu_mul_quant 2025-05-07T20:32:04.6920215Z if compiled: 2025-05-07T20:32:04.6920471Z op = torch.compile(op) 2025-05-07T20:32:04.6920765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6921196Z 2025-05-07T20:32:04.6921393Z > y_fp8, y_scale = fn() 2025-05-07T20:32:04.6929819Z 2025-05-07T20:32:04.6929950Z moe/activation_test.py:117: 2025-05-07T20:32:04.6930279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6930622Z moe/activation_test.py:115: in fn 2025-05-07T20:32:04.6930911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6931606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:04.6932297Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:04.6932834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.6933574Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.6934287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.6934820Z kernel = self.compile( 2025-05-07T20:32:04.6935357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.6936004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.6936409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6936635Z 2025-05-07T20:32:04.6936845Z self = 2025-05-07T20:32:04.6937921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.6939286Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a704dce00>} 2025-05-07T20:32:04.6940609Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.6941619Z context = 2025-05-07T20:32:04.6941901Z 2025-05-07T20:32:04.6942067Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.6942584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.6943053Z module_map=module_map) 2025-05-07T20:32:04.6943421Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.6943774Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.6944037Z E ^ 2025-05-07T20:32:04.6944500Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6944947Z 2025-05-07T20:32:04.6945356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.6945866Z 2025-05-07T20:32:04.6945972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.6946387Z self=, 2025-05-07T20:32:04.6946786Z T=2048, 2025-05-07T20:32:04.6946974Z D=5120, 2025-05-07T20:32:04.6947176Z scale_ub=1200.0, 2025-05-07T20:32:04.6947406Z contiguous=True, 2025-05-07T20:32:04.6948175Z compiled=True, 2025-05-07T20:32:04.6948397Z ) 2025-05-07T20:32:04.6948717Z self = 2025-05-07T20:32:04.6949206Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:04.6949483Z 2025-05-07T20:32:04.6949563Z @given( 2025-05-07T20:32:04.6949797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:04.6950191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:04.6950512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:04.6950853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:04.6951193Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:04.6951478Z ) 2025-05-07T20:32:04.6951835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:04.6952281Z def test_silu_mul_quant( 2025-05-07T20:32:04.6952521Z self, 2025-05-07T20:32:04.6952728Z T: int, 2025-05-07T20:32:04.6952934Z D: int, 2025-05-07T20:32:04.6953153Z scale_ub: Optional[float], 2025-05-07T20:32:04.6953434Z contiguous: bool, 2025-05-07T20:32:04.6953688Z compiled: bool, 2025-05-07T20:32:04.6953906Z ) -> None: 2025-05-07T20:32:04.6954130Z torch.manual_seed(2025) 2025-05-07T20:32:04.6954377Z 2025-05-07T20:32:04.6954647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:04.6955001Z 2025-05-07T20:32:04.6955204Z x_sign = torch.sign(x) 2025-05-07T20:32:04.6955519Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:04.6955825Z x = x_sign * x_clamp 2025-05-07T20:32:04.6956074Z x0 = x[:, :D] 2025-05-07T20:32:04.6956302Z x1 = x[:, D:] 2025-05-07T20:32:04.6956515Z 2025-05-07T20:32:04.6956698Z if contiguous: 2025-05-07T20:32:04.6956934Z x0 = x0.contiguous() 2025-05-07T20:32:04.6957200Z x1 = x1.contiguous() 2025-05-07T20:32:04.6957449Z 2025-05-07T20:32:04.6957649Z if scale_ub is not None: 2025-05-07T20:32:04.6957933Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:04.6958267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:04.6958583Z ) 2025-05-07T20:32:04.6958786Z else: 2025-05-07T20:32:04.6958998Z scale_ub_tensor = None 2025-05-07T20:32:04.6959666Z 2025-05-07T20:32:04.6960002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6960416Z op = silu_mul_quant 2025-05-07T20:32:04.6960743Z if compiled: 2025-05-07T20:32:04.6961066Z op = torch.compile(op) 2025-05-07T20:32:04.6961416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:04.6961698Z 2025-05-07T20:32:04.6961907Z y_fp8, y_scale = fn() 2025-05-07T20:32:04.6962195Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:04.6962481Z 2025-05-07T20:32:04.6962723Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:04.6963056Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:04.6963348Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:04.6963664Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:04.6964090Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:04.6964403Z 2025-05-07T20:32:04.6964619Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:04.6964816Z 2025-05-07T20:32:04.6964927Z moe/activation_test.py:126: 2025-05-07T20:32:04.6965219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6965551Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:04.6965878Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:04.6966661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:04.6967582Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:04.6968133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:04.6968809Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:04.6969492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:04.6970317Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:04.6971037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:04.6971672Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:04.6972259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:04.6972778Z fn() 2025-05-07T20:32:04.6973401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:04.6974032Z self.fn.run( 2025-05-07T20:32:04.6974492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:04.6975020Z kernel = self.compile( 2025-05-07T20:32:04.6975564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:04.6976213Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.6976619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:04.6976853Z 2025-05-07T20:32:04.6977057Z self = 2025-05-07T20:32:04.6978132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:04.6979489Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a701ed440>} 2025-05-07T20:32:04.6980803Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:04.6981824Z context = 2025-05-07T20:32:04.6982121Z 2025-05-07T20:32:04.6982297Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:04.6982819Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.6983282Z module_map=module_map) 2025-05-07T20:32:04.6983666Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.6984037Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:04.6984307Z E ^ 2025-05-07T20:32:04.6984778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.6985226Z 2025-05-07T20:32:04.6985639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:04.6986145Z 2025-05-07T20:32:04.6986258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:04.6986664Z self=, 2025-05-07T20:32:04.6987069Z T=16384, 2025-05-07T20:32:04.6987281Z D=7168, 2025-05-07T20:32:04.6987474Z scale_ub=1200.0, 2025-05-07T20:32:04.6987702Z contiguous=False, 2025-05-07T20:32:04.6987934Z compiled=False, 2025-05-07T20:32:04.6988137Z ) 2025-05-07T20:32:04.8917491Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.8920095Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:04.8922730Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.8924815Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.8925783Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.8927087Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.8928449Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.8929442Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.8930662Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.8932026Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.8933148Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.8934415Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.8935655Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:04.8936861Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.8938060Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:04.8938889Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.8939905Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:04.8940917Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:04.8941698Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:04.8943000Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.8944318Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.8945420Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:04.8946511Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:04.8947669Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.8949015Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.8950059Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.8950954Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.8951680Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:04.8952689Z W0507 20:32:04.888000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:04.9515040Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:04.9516276Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:04.9517594Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:04.9519004Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:04.9519971Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.9521258Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:04.9522614Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:04.9523583Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.9524795Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:04.9526159Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:04.9527520Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.9528786Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:04.9530143Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:04.9531343Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:04.9532523Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:04.9533443Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:04.9534507Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:04.9535515Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:04.9536305Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:04.9537514Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:04.9538785Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:04.9539897Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:04.9540915Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:04.9542085Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:04.9543422Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:04.9544532Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:04.9545434Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:04.9546157Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:04.9547159Z W0507 20:32:04.948000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.1448059Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.1449109Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:05.1450788Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.1452208Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.1453404Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.1454741Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.1456107Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.1457075Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.1458284Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.1459907Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.1460967Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.1462229Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.1463462Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:05.1464720Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.1465911Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:05.1466724Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.1467736Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:05.1468740Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:05.1469526Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:05.1470722Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.1471986Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.1473256Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:05.1474348Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:05.1475513Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.1476962Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.1478018Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.1478935Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.1479673Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:05.1480673Z W0507 20:32:05.141000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.1544682Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:05.1545963Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:05.1547280Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:05.1548674Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:05.1549639Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.1550926Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:05.1552285Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.1553259Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.1554516Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:05.1555870Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.1556906Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.1558263Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:05.1559746Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:05.1560951Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:05.1562277Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:05.1563087Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:05.1564125Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:05.1565159Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:05.1565947Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:05.1567137Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:05.1568395Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:05.1569497Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:05.1570520Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:05.1571686Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:05.1573075Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:05.1574127Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.1575025Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.1575763Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:05.1576766Z W0507 20:32:05.151000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.9004394Z self = 2025-05-07T20:32:05.9005129Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:05.9005519Z 2025-05-07T20:32:05.9005632Z @given( 2025-05-07T20:32:05.9005959Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.9006304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.9006610Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.9006946Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.9007276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.9007896Z ) 2025-05-07T20:32:05.9008260Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.9008703Z def test_silu_mul_quant( 2025-05-07T20:32:05.9008952Z self, 2025-05-07T20:32:05.9009145Z T: int, 2025-05-07T20:32:05.9009350Z D: int, 2025-05-07T20:32:05.9009573Z scale_ub: Optional[float], 2025-05-07T20:32:05.9009844Z contiguous: bool, 2025-05-07T20:32:05.9010243Z compiled: bool, 2025-05-07T20:32:05.9010475Z ) -> None: 2025-05-07T20:32:05.9010690Z torch.manual_seed(2025) 2025-05-07T20:32:05.9010939Z 2025-05-07T20:32:05.9011217Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.9011559Z 2025-05-07T20:32:05.9011755Z x_sign = torch.sign(x) 2025-05-07T20:32:05.9012049Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.9012353Z x = x_sign * x_clamp 2025-05-07T20:32:05.9012597Z x0 = x[:, :D] 2025-05-07T20:32:05.9012823Z x1 = x[:, D:] 2025-05-07T20:32:05.9013128Z 2025-05-07T20:32:05.9013323Z if contiguous: 2025-05-07T20:32:05.9013559Z x0 = x0.contiguous() 2025-05-07T20:32:05.9013818Z x1 = x1.contiguous() 2025-05-07T20:32:05.9014057Z 2025-05-07T20:32:05.9014254Z if scale_ub is not None: 2025-05-07T20:32:05.9014529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.9014867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.9015177Z ) 2025-05-07T20:32:05.9015371Z else: 2025-05-07T20:32:05.9015583Z scale_ub_tensor = None 2025-05-07T20:32:05.9015838Z 2025-05-07T20:32:05.9016072Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.9016381Z op = silu_mul_quant 2025-05-07T20:32:05.9016631Z if compiled: 2025-05-07T20:32:05.9016881Z op = torch.compile(op) 2025-05-07T20:32:05.9017179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.9017454Z 2025-05-07T20:32:05.9017652Z > y_fp8, y_scale = fn() 2025-05-07T20:32:05.9017817Z 2025-05-07T20:32:05.9017926Z moe/activation_test.py:117: 2025-05-07T20:32:05.9018218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.9018549Z moe/activation_test.py:115: in fn 2025-05-07T20:32:05.9018837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.9019526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:05.9020214Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:05.9020752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.9021432Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.9022088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.9022622Z kernel = self.compile( 2025-05-07T20:32:05.9023166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.9023809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.9024242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.9024500Z 2025-05-07T20:32:05.9024708Z self = 2025-05-07T20:32:05.9025779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.9027232Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5a149940>} 2025-05-07T20:32:05.9028564Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.9029579Z context = 2025-05-07T20:32:05.9029864Z 2025-05-07T20:32:05.9030112Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.9030630Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.9031096Z module_map=module_map) 2025-05-07T20:32:05.9031466Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.9031816Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:05.9032076Z E ^ 2025-05-07T20:32:05.9032548Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.9032990Z 2025-05-07T20:32:05.9033409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.9033911Z 2025-05-07T20:32:05.9034024Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.9034432Z self=, 2025-05-07T20:32:05.9034836Z T=1, 2025-05-07T20:32:05.9035022Z D=7168, 2025-05-07T20:32:05.9035215Z scale_ub=None, 2025-05-07T20:32:05.9035435Z contiguous=True, 2025-05-07T20:32:05.9035664Z compiled=True, 2025-05-07T20:32:05.9035875Z ) 2025-05-07T20:32:05.9036198Z self = 2025-05-07T20:32:05.9036681Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:05.9036936Z 2025-05-07T20:32:05.9037013Z @given( 2025-05-07T20:32:05.9037254Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:05.9037575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:05.9037892Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:05.9038216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:05.9038548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:05.9038835Z ) 2025-05-07T20:32:05.9039179Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:05.9039637Z def test_silu_mul_quant( 2025-05-07T20:32:05.9039890Z self, 2025-05-07T20:32:05.9040090Z T: int, 2025-05-07T20:32:05.9040304Z D: int, 2025-05-07T20:32:05.9040534Z scale_ub: Optional[float], 2025-05-07T20:32:05.9040803Z contiguous: bool, 2025-05-07T20:32:05.9041052Z compiled: bool, 2025-05-07T20:32:05.9041283Z ) -> None: 2025-05-07T20:32:05.9041501Z torch.manual_seed(2025) 2025-05-07T20:32:05.9041772Z 2025-05-07T20:32:05.9042056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:05.9042406Z 2025-05-07T20:32:05.9042611Z x_sign = torch.sign(x) 2025-05-07T20:32:05.9042904Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:05.9043231Z x = x_sign * x_clamp 2025-05-07T20:32:05.9043481Z x0 = x[:, :D] 2025-05-07T20:32:05.9043698Z x1 = x[:, D:] 2025-05-07T20:32:05.9043919Z 2025-05-07T20:32:05.9044115Z if contiguous: 2025-05-07T20:32:05.9052342Z x0 = x0.contiguous() 2025-05-07T20:32:05.9052627Z x1 = x1.contiguous() 2025-05-07T20:32:05.9052871Z 2025-05-07T20:32:05.9053159Z if scale_ub is not None: 2025-05-07T20:32:05.9053443Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:05.9053782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:05.9054105Z ) 2025-05-07T20:32:05.9054331Z else: 2025-05-07T20:32:05.9054565Z scale_ub_tensor = None 2025-05-07T20:32:05.9054942Z 2025-05-07T20:32:05.9055186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.9055506Z op = silu_mul_quant 2025-05-07T20:32:05.9055755Z if compiled: 2025-05-07T20:32:05.9056010Z op = torch.compile(op) 2025-05-07T20:32:05.9056315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:05.9056588Z 2025-05-07T20:32:05.9056792Z y_fp8, y_scale = fn() 2025-05-07T20:32:05.9057165Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:05.9057457Z 2025-05-07T20:32:05.9057705Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:05.9058068Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:05.9058365Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:05.9058675Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:05.9059034Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.9059714Z 2025-05-07T20:32:05.9059977Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:05.9060180Z 2025-05-07T20:32:05.9060284Z moe/activation_test.py:126: 2025-05-07T20:32:05.9060585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.9060918Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:05.9061246Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:05.9062044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:05.9062792Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:05.9063327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:05.9064000Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:05.9064688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:05.9065404Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:05.9066118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:05.9066751Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:05.9067350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:05.9067866Z fn() 2025-05-07T20:32:05.9068373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:05.9068952Z self.fn.run( 2025-05-07T20:32:05.9069428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:05.9069950Z kernel = self.compile( 2025-05-07T20:32:05.9070497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:05.9071146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:05.9071538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:05.9071773Z 2025-05-07T20:32:05.9071981Z self = 2025-05-07T20:32:05.9073059Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:05.9074415Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a700c0860>} 2025-05-07T20:32:05.9075922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:05.9076933Z context = 2025-05-07T20:32:05.9077227Z 2025-05-07T20:32:05.9077398Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:05.9077924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:05.9078517Z module_map=module_map) 2025-05-07T20:32:05.9078887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:05.9079257Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:05.9079530Z E ^ 2025-05-07T20:32:05.9079991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:05.9080440Z 2025-05-07T20:32:05.9080855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:05.9081367Z 2025-05-07T20:32:05.9081472Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:05.9081888Z self=, 2025-05-07T20:32:05.9082282Z T=4096, 2025-05-07T20:32:05.9082481Z D=5120, 2025-05-07T20:32:05.9082677Z scale_ub=None, 2025-05-07T20:32:05.9082892Z contiguous=False, 2025-05-07T20:32:05.9083143Z compiled=False, 2025-05-07T20:32:05.9083357Z ) 2025-05-07T20:32:06.2091512Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.2093793Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:06.2095279Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.2096767Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.2097728Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.2099032Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.2100404Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.2101374Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.2102586Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.2103942Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.2104997Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.2106617Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.2107858Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:06.2109062Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.2110408Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:06.2111219Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.2112237Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:06.2113241Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:06.2114027Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:06.2115267Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.2116531Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.2117633Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.2118658Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:06.2119819Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.2121152Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.2122200Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.2123105Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.2123841Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:06.2124888Z W0507 20:32:06.205000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.4160137Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.4162236Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:06.4165035Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.4166454Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.4167419Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.4169260Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.4170638Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.4171615Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.4172834Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.4174363Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.4175414Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.4176687Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.4177919Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:06.4179117Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.4180316Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:06.4181127Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.4182136Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:06.4183150Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:06.4183939Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:06.4185128Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.4186402Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.4187501Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.4188615Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:06.4189783Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.4191110Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.4192231Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.4193128Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.4193862Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:06.4195083Z W0507 20:32:06.412000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.7135000Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.7136088Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:06.7137399Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.7138832Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.7139799Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.7141083Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.7142447Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.7143412Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.7144640Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.7145989Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.7147032Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.7148288Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.7149838Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:06.7151052Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.7152237Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:06.7153189Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.7154201Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:06.7155209Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:06.7155999Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:06.7157192Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.7158443Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.7159798Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.7160826Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:06.7161998Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.7163327Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.7164376Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.7165285Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.7166011Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:06.7167014Z W0507 20:32:06.710000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:06.7237259Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:06.7238288Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:06.7239599Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:06.7240991Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:06.7242102Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.7243383Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:06.7244738Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:06.7245821Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.7247047Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:06.7248399Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:06.7249440Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.7250706Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:06.7251931Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:06.7253246Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:06.7254437Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:06.7255296Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:06.7256310Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:06.7257309Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:06.7258096Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:06.7259496Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:06.7260761Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:06.7261870Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:06.7262898Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:06.7264195Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:06.7265582Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:06.7266629Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:06.7267634Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:06.7268359Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:06.7269360Z W0507 20:32:06.720000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.0897315Z self = 2025-05-07T20:32:08.0898120Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:08.0898409Z 2025-05-07T20:32:08.0898493Z @given( 2025-05-07T20:32:08.0898746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.0899070Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.0899384Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.0899725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.0900170Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.0900576Z ) 2025-05-07T20:32:08.0900982Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.0901437Z def test_silu_mul_quant( 2025-05-07T20:32:08.0901773Z self, 2025-05-07T20:32:08.0902049Z T: int, 2025-05-07T20:32:08.0902334Z D: int, 2025-05-07T20:32:08.0902640Z scale_ub: Optional[float], 2025-05-07T20:32:08.0903017Z contiguous: bool, 2025-05-07T20:32:08.0903272Z compiled: bool, 2025-05-07T20:32:08.0903509Z ) -> None: 2025-05-07T20:32:08.0903726Z torch.manual_seed(2025) 2025-05-07T20:32:08.0904035Z 2025-05-07T20:32:08.0904421Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.0904888Z 2025-05-07T20:32:08.0905152Z x_sign = torch.sign(x) 2025-05-07T20:32:08.0905493Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.0905819Z x = x_sign * x_clamp 2025-05-07T20:32:08.0906069Z x0 = x[:, :D] 2025-05-07T20:32:08.0906301Z x1 = x[:, D:] 2025-05-07T20:32:08.0906522Z 2025-05-07T20:32:08.0906719Z if contiguous: 2025-05-07T20:32:08.0906975Z x0 = x0.contiguous() 2025-05-07T20:32:08.0907249Z x1 = x1.contiguous() 2025-05-07T20:32:08.0907502Z 2025-05-07T20:32:08.0907716Z if scale_ub is not None: 2025-05-07T20:32:08.0908011Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.0908346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.0908670Z ) 2025-05-07T20:32:08.0908882Z else: 2025-05-07T20:32:08.0909106Z scale_ub_tensor = None 2025-05-07T20:32:08.0909381Z 2025-05-07T20:32:08.0909629Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.0909960Z op = silu_mul_quant 2025-05-07T20:32:08.0910228Z if compiled: 2025-05-07T20:32:08.0910496Z op = torch.compile(op) 2025-05-07T20:32:08.0910810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.0911090Z 2025-05-07T20:32:08.0911295Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.0911466Z 2025-05-07T20:32:08.0911585Z moe/activation_test.py:117: 2025-05-07T20:32:08.0911888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.0912235Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.0912877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.0913579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.0914279Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.0914826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.0915715Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.0916379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.0916925Z kernel = self.compile( 2025-05-07T20:32:08.0917486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.0918146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.0918555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.0918794Z 2025-05-07T20:32:08.0919008Z self = 2025-05-07T20:32:08.0920095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.0921492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a596c40e0>} 2025-05-07T20:32:08.0922821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.0923849Z context = 2025-05-07T20:32:08.0924151Z 2025-05-07T20:32:08.0924325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.0924855Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.0925353Z module_map=module_map) 2025-05-07T20:32:08.0925762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.0926129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.0926410Z E ^ 2025-05-07T20:32:08.0926887Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.0927347Z 2025-05-07T20:32:08.0927764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.0928273Z 2025-05-07T20:32:08.0928392Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.0928817Z self=, 2025-05-07T20:32:08.0929232Z T=4096, 2025-05-07T20:32:08.0929439Z D=7168, 2025-05-07T20:32:08.0929642Z scale_ub=None, 2025-05-07T20:32:08.0929878Z contiguous=False, 2025-05-07T20:32:08.0930120Z compiled=False, 2025-05-07T20:32:08.0930354Z ) 2025-05-07T20:32:08.0930683Z self = 2025-05-07T20:32:08.0931191Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:08.0931471Z 2025-05-07T20:32:08.0931564Z @given( 2025-05-07T20:32:08.0931809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.0932135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.0932457Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.0932793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.0933326Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.0933629Z ) 2025-05-07T20:32:08.0934108Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.0934562Z def test_silu_mul_quant( 2025-05-07T20:32:08.0934815Z self, 2025-05-07T20:32:08.0935032Z T: int, 2025-05-07T20:32:08.0935240Z D: int, 2025-05-07T20:32:08.0935474Z scale_ub: Optional[float], 2025-05-07T20:32:08.0935761Z contiguous: bool, 2025-05-07T20:32:08.0936011Z compiled: bool, 2025-05-07T20:32:08.0936335Z ) -> None: 2025-05-07T20:32:08.0936570Z torch.manual_seed(2025) 2025-05-07T20:32:08.0936821Z 2025-05-07T20:32:08.0937106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.0937459Z 2025-05-07T20:32:08.0937662Z x_sign = torch.sign(x) 2025-05-07T20:32:08.0937966Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.0938288Z x = x_sign * x_clamp 2025-05-07T20:32:08.0938537Z x0 = x[:, :D] 2025-05-07T20:32:08.0938770Z x1 = x[:, D:] 2025-05-07T20:32:08.0939003Z 2025-05-07T20:32:08.0939200Z if contiguous: 2025-05-07T20:32:08.0939445Z x0 = x0.contiguous() 2025-05-07T20:32:08.0939721Z x1 = x1.contiguous() 2025-05-07T20:32:08.0939980Z 2025-05-07T20:32:08.0940185Z if scale_ub is not None: 2025-05-07T20:32:08.0940475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.0940823Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.0941143Z ) 2025-05-07T20:32:08.0941356Z else: 2025-05-07T20:32:08.0941583Z scale_ub_tensor = None 2025-05-07T20:32:08.0941846Z 2025-05-07T20:32:08.0942094Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.0942422Z op = silu_mul_quant 2025-05-07T20:32:08.0942683Z if compiled: 2025-05-07T20:32:08.0942949Z op = torch.compile(op) 2025-05-07T20:32:08.0943260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.0943546Z 2025-05-07T20:32:08.0943763Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.0943933Z 2025-05-07T20:32:08.0944049Z moe/activation_test.py:117: 2025-05-07T20:32:08.0944362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.0944703Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.0944999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.0945693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.0946383Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.0946927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.0947614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.0948281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.0948824Z kernel = self.compile( 2025-05-07T20:32:08.0949405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.0950062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.0950470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.0950700Z 2025-05-07T20:32:08.0950918Z self = 2025-05-07T20:32:08.0951994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.0953361Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a596c6160>} 2025-05-07T20:32:08.0954816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.0955834Z context = 2025-05-07T20:32:08.0956129Z 2025-05-07T20:32:08.0956302Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.0956925Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.0957402Z module_map=module_map) 2025-05-07T20:32:08.0957773Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.0958138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.0958412Z E ^ 2025-05-07T20:32:08.0958881Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.0959765Z 2025-05-07T20:32:08.0960326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.0960913Z 2025-05-07T20:32:08.0961020Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.0961435Z self=, 2025-05-07T20:32:08.0961832Z T=128, 2025-05-07T20:32:08.0962030Z D=7168, 2025-05-07T20:32:08.0962243Z scale_ub=None, 2025-05-07T20:32:08.0962459Z contiguous=False, 2025-05-07T20:32:08.0962692Z compiled=True, 2025-05-07T20:32:08.0962902Z ) 2025-05-07T20:32:08.1529618Z self = 2025-05-07T20:32:08.1530339Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:08.1530697Z 2025-05-07T20:32:08.1530813Z @given( 2025-05-07T20:32:08.1531108Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.1531456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.1531774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.1532115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.1532443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.1532731Z ) 2025-05-07T20:32:08.1542684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.1543205Z def test_silu_mul_quant( 2025-05-07T20:32:08.1543457Z self, 2025-05-07T20:32:08.1543652Z T: int, 2025-05-07T20:32:08.1543853Z D: int, 2025-05-07T20:32:08.1544074Z scale_ub: Optional[float], 2025-05-07T20:32:08.1544338Z contiguous: bool, 2025-05-07T20:32:08.1544583Z compiled: bool, 2025-05-07T20:32:08.1544809Z ) -> None: 2025-05-07T20:32:08.1545022Z torch.manual_seed(2025) 2025-05-07T20:32:08.1545314Z 2025-05-07T20:32:08.1545604Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.1545943Z 2025-05-07T20:32:08.1546139Z x_sign = torch.sign(x) 2025-05-07T20:32:08.1546432Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.1546732Z x = x_sign * x_clamp 2025-05-07T20:32:08.1546980Z x0 = x[:, :D] 2025-05-07T20:32:08.1547202Z x1 = x[:, D:] 2025-05-07T20:32:08.1547409Z 2025-05-07T20:32:08.1547598Z if contiguous: 2025-05-07T20:32:08.1547840Z x0 = x0.contiguous() 2025-05-07T20:32:08.1548092Z x1 = x1.contiguous() 2025-05-07T20:32:08.1548333Z 2025-05-07T20:32:08.1548528Z if scale_ub is not None: 2025-05-07T20:32:08.1548801Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.1549128Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.1549438Z ) 2025-05-07T20:32:08.1549634Z else: 2025-05-07T20:32:08.1549844Z scale_ub_tensor = None 2025-05-07T20:32:08.1550107Z 2025-05-07T20:32:08.1550642Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.1550958Z op = silu_mul_quant 2025-05-07T20:32:08.1551214Z if compiled: 2025-05-07T20:32:08.1551466Z op = torch.compile(op) 2025-05-07T20:32:08.1551755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.1552033Z 2025-05-07T20:32:08.1552234Z y_fp8, y_scale = fn() 2025-05-07T20:32:08.1552511Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:08.1552952Z 2025-05-07T20:32:08.1553188Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.1553521Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:08.1553808Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:08.1554119Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:08.1554470Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.1554775Z 2025-05-07T20:32:08.1554986Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:08.1555182Z 2025-05-07T20:32:08.1555314Z moe/activation_test.py:126: 2025-05-07T20:32:08.1555629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.1555967Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:08.1556296Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:08.1557077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:08.1557827Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:08.1558374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.1559061Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.1560004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:08.1560728Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:08.1561449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:08.1562080Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:08.1562668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:08.1563196Z fn() 2025-05-07T20:32:08.1563706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:08.1564285Z self.fn.run( 2025-05-07T20:32:08.1564747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.1565278Z kernel = self.compile( 2025-05-07T20:32:08.1565825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.1566469Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.1566875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.1567113Z 2025-05-07T20:32:08.1567318Z self = 2025-05-07T20:32:08.1568387Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.1569755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5a226340>} 2025-05-07T20:32:08.1571200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.1572212Z context = 2025-05-07T20:32:08.1572497Z 2025-05-07T20:32:08.1572674Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.1573280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.1573862Z module_map=module_map) 2025-05-07T20:32:08.1574228Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.1574588Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:08.1574851Z E ^ 2025-05-07T20:32:08.1575310Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.1575752Z 2025-05-07T20:32:08.1576168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.1576677Z 2025-05-07T20:32:08.1576790Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.1577193Z self=, 2025-05-07T20:32:08.1577589Z T=128, 2025-05-07T20:32:08.1577784Z D=7168, 2025-05-07T20:32:08.1577978Z scale_ub=None, 2025-05-07T20:32:08.1578202Z contiguous=False, 2025-05-07T20:32:08.1578432Z compiled=False, 2025-05-07T20:32:08.1578643Z ) 2025-05-07T20:32:08.3543280Z self = 2025-05-07T20:32:08.3543996Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:08.3544367Z 2025-05-07T20:32:08.3544483Z @given( 2025-05-07T20:32:08.3544795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.3545234Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.3545577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.3545931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.3546269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.3546570Z ) 2025-05-07T20:32:08.3546922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.3547382Z def test_silu_mul_quant( 2025-05-07T20:32:08.3547642Z self, 2025-05-07T20:32:08.3547851Z T: int, 2025-05-07T20:32:08.3548054Z D: int, 2025-05-07T20:32:08.3548295Z scale_ub: Optional[float], 2025-05-07T20:32:08.3548578Z contiguous: bool, 2025-05-07T20:32:08.3548819Z compiled: bool, 2025-05-07T20:32:08.3549058Z ) -> None: 2025-05-07T20:32:08.3549287Z torch.manual_seed(2025) 2025-05-07T20:32:08.3549534Z 2025-05-07T20:32:08.3549811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.3550163Z 2025-05-07T20:32:08.3550362Z x_sign = torch.sign(x) 2025-05-07T20:32:08.3550678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.3551028Z x = x_sign * x_clamp 2025-05-07T20:32:08.3551284Z x0 = x[:, :D] 2025-05-07T20:32:08.3551510Z x1 = x[:, D:] 2025-05-07T20:32:08.3551722Z 2025-05-07T20:32:08.3551921Z if contiguous: 2025-05-07T20:32:08.3552166Z x0 = x0.contiguous() 2025-05-07T20:32:08.3552428Z x1 = x1.contiguous() 2025-05-07T20:32:08.3552687Z 2025-05-07T20:32:08.3552893Z if scale_ub is not None: 2025-05-07T20:32:08.3553173Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.3553523Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.3553843Z ) 2025-05-07T20:32:08.3554037Z else: 2025-05-07T20:32:08.3554261Z scale_ub_tensor = None 2025-05-07T20:32:08.3554523Z 2025-05-07T20:32:08.3554768Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3555085Z op = silu_mul_quant 2025-05-07T20:32:08.3555705Z if compiled: 2025-05-07T20:32:08.3555969Z op = torch.compile(op) 2025-05-07T20:32:08.3556266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3556543Z 2025-05-07T20:32:08.3556740Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.3556905Z 2025-05-07T20:32:08.3557007Z moe/activation_test.py:117: 2025-05-07T20:32:08.3557303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3557766Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.3558042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3558730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.3559727Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.3560275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.3560962Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.3561631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.3562167Z kernel = self.compile( 2025-05-07T20:32:08.3562725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.3563371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3563777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3564004Z 2025-05-07T20:32:08.3564222Z self = 2025-05-07T20:32:08.3565298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.3566669Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58fe8180>} 2025-05-07T20:32:08.3568003Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.3569021Z context = 2025-05-07T20:32:08.3569307Z 2025-05-07T20:32:08.3569480Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.3569996Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3570470Z module_map=module_map) 2025-05-07T20:32:08.3570843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3571206Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3571466Z E ^ 2025-05-07T20:32:08.3571938Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3572384Z 2025-05-07T20:32:08.3572808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.3573436Z 2025-05-07T20:32:08.3573549Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3573968Z self=, 2025-05-07T20:32:08.3574378Z T=4096, 2025-05-07T20:32:08.3574573Z D=5120, 2025-05-07T20:32:08.3574767Z scale_ub=1200.0, 2025-05-07T20:32:08.3575005Z contiguous=True, 2025-05-07T20:32:08.3575259Z compiled=False, 2025-05-07T20:32:08.3575490Z ) 2025-05-07T20:32:08.3575816Z self = 2025-05-07T20:32:08.3576312Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:08.3576709Z 2025-05-07T20:32:08.3576791Z @given( 2025-05-07T20:32:08.3577030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:08.3577350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:08.3577660Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:08.3577989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:08.3578318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:08.3578760Z ) 2025-05-07T20:32:08.3579112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:08.3579555Z def test_silu_mul_quant( 2025-05-07T20:32:08.3579817Z self, 2025-05-07T20:32:08.3580006Z T: int, 2025-05-07T20:32:08.3580211Z D: int, 2025-05-07T20:32:08.3580434Z scale_ub: Optional[float], 2025-05-07T20:32:08.3580708Z contiguous: bool, 2025-05-07T20:32:08.3580952Z compiled: bool, 2025-05-07T20:32:08.3581185Z ) -> None: 2025-05-07T20:32:08.3581407Z torch.manual_seed(2025) 2025-05-07T20:32:08.3581654Z 2025-05-07T20:32:08.3581932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:08.3582266Z 2025-05-07T20:32:08.3582461Z x_sign = torch.sign(x) 2025-05-07T20:32:08.3582756Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:08.3583065Z x = x_sign * x_clamp 2025-05-07T20:32:08.3583324Z x0 = x[:, :D] 2025-05-07T20:32:08.3583545Z x1 = x[:, D:] 2025-05-07T20:32:08.3583754Z 2025-05-07T20:32:08.3583944Z if contiguous: 2025-05-07T20:32:08.3584181Z x0 = x0.contiguous() 2025-05-07T20:32:08.3584435Z x1 = x1.contiguous() 2025-05-07T20:32:08.3584681Z 2025-05-07T20:32:08.3584879Z if scale_ub is not None: 2025-05-07T20:32:08.3585160Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:08.3585499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:08.3585819Z ) 2025-05-07T20:32:08.3586017Z else: 2025-05-07T20:32:08.3586221Z scale_ub_tensor = None 2025-05-07T20:32:08.3586474Z 2025-05-07T20:32:08.3586710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:08.3587020Z op = silu_mul_quant 2025-05-07T20:32:08.3587271Z if compiled: 2025-05-07T20:32:08.3587521Z op = torch.compile(op) 2025-05-07T20:32:08.3587822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3588101Z 2025-05-07T20:32:08.3588297Z > y_fp8, y_scale = fn() 2025-05-07T20:32:08.3588462Z 2025-05-07T20:32:08.3588561Z moe/activation_test.py:117: 2025-05-07T20:32:08.3588862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3589197Z moe/activation_test.py:115: in fn 2025-05-07T20:32:08.3589479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:08.3590161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:08.3590845Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:08.3591379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:08.3592048Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:08.3592706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:08.3593240Z kernel = self.compile( 2025-05-07T20:32:08.3593778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:08.3594416Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.3594816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:08.3595048Z 2025-05-07T20:32:08.3595392Z self = 2025-05-07T20:32:08.3596466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:08.3597816Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58fe9ee0>} 2025-05-07T20:32:08.3599221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:08.3600237Z context = 2025-05-07T20:32:08.3600527Z 2025-05-07T20:32:08.3600706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:08.3601238Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.3601708Z module_map=module_map) 2025-05-07T20:32:08.3602077Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.3602439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.3602699Z E ^ 2025-05-07T20:32:08.3603172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.3603625Z 2025-05-07T20:32:08.3604045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:08.3604552Z 2025-05-07T20:32:08.3604666Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:08.3605084Z self=, 2025-05-07T20:32:08.3605513Z T=1, 2025-05-07T20:32:08.3605725Z D=5120, 2025-05-07T20:32:08.3605917Z scale_ub=None, 2025-05-07T20:32:08.3606148Z contiguous=True, 2025-05-07T20:32:08.3606373Z compiled=True, 2025-05-07T20:32:08.3606573Z ) 2025-05-07T20:32:08.6077083Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.6078284Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:08.6079646Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.6081066Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.6082055Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.6083351Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.6084724Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.6085701Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.6087205Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.6088567Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.6089629Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.6091029Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.6092260Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:08.6093561Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.6094754Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:08.6095563Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.6096573Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:08.6097585Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:08.6098368Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:08.6099556Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.6100824Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.6101928Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.6102948Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:08.6104118Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.6105510Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.6106552Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.6107455Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.6108184Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:08.6109188Z W0507 20:32:08.604000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.6790790Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.6791848Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:08.6793158Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.6794715Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.6795685Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.6796972Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.6798334Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.6799323Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.6800532Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.6801890Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.6802937Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.6804203Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.6805432Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:08.6806635Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.6807834Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:08.6808657Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.6809661Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:08.6810670Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:08.6811453Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:08.6812762Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.6814112Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.6815207Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.6816307Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:08.6817464Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.6818802Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.6819847Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.6820734Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.6821464Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:08.6822459Z W0507 20:32:08.676000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.8881437Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.8883864Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:08.8885783Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.8887205Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.8888160Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.8889446Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.8890798Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.8891755Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.8892962Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.8894710Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.8895761Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.8897019Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.8898372Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:08.8899570Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.8900758Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:08.8901580Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.8902588Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:08.8903584Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:08.8904367Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:08.8905556Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.8906829Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.8907929Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.8908965Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:08.8910125Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.8911465Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.8912521Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.8913409Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.8914148Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:08.8915160Z W0507 20:32:08.884000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.8988457Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.8989975Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:08.8991287Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.8992681Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.8993769Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.8995055Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.8996472Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.8997428Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.8998635Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.8999984Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9001028Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.9002283Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.9003503Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:08.9004695Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.9005931Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:08.9006751Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.9007757Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:08.9008771Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:08.9009560Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:08.9010747Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.9012090Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.9013299Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:08.9014329Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:08.9015594Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.9025431Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.9026554Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9027452Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9028174Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:08.9029168Z W0507 20:32:08.895000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.1219444Z self = 2025-05-07T20:32:09.1220071Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:09.1220339Z 2025-05-07T20:32:09.1220426Z @given( 2025-05-07T20:32:09.1220684Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.1221014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.1221358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.1221699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.1222046Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.1222346Z ) 2025-05-07T20:32:09.1222702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.1223157Z def test_silu_mul_quant( 2025-05-07T20:32:09.1223424Z self, 2025-05-07T20:32:09.1223626Z T: int, 2025-05-07T20:32:09.1223843Z D: int, 2025-05-07T20:32:09.1224080Z scale_ub: Optional[float], 2025-05-07T20:32:09.1224361Z contiguous: bool, 2025-05-07T20:32:09.1224628Z compiled: bool, 2025-05-07T20:32:09.1224875Z ) -> None: 2025-05-07T20:32:09.1225108Z torch.manual_seed(2025) 2025-05-07T20:32:09.1225367Z 2025-05-07T20:32:09.1225656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.1226012Z 2025-05-07T20:32:09.1226239Z x_sign = torch.sign(x) 2025-05-07T20:32:09.1226543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.1226873Z x = x_sign * x_clamp 2025-05-07T20:32:09.1227127Z x0 = x[:, :D] 2025-05-07T20:32:09.1227348Z x1 = x[:, D:] 2025-05-07T20:32:09.1227573Z 2025-05-07T20:32:09.1227774Z if contiguous: 2025-05-07T20:32:09.1228013Z x0 = x0.contiguous() 2025-05-07T20:32:09.1228293Z x1 = x1.contiguous() 2025-05-07T20:32:09.1228554Z 2025-05-07T20:32:09.1228757Z if scale_ub is not None: 2025-05-07T20:32:09.1229038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.1229391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.1229709Z ) 2025-05-07T20:32:09.1229921Z else: 2025-05-07T20:32:09.1230148Z scale_ub_tensor = None 2025-05-07T20:32:09.1230417Z 2025-05-07T20:32:09.1230665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.1231311Z op = silu_mul_quant 2025-05-07T20:32:09.1231574Z if compiled: 2025-05-07T20:32:09.1231842Z op = torch.compile(op) 2025-05-07T20:32:09.1232151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.1232438Z 2025-05-07T20:32:09.1232638Z y_fp8, y_scale = fn() 2025-05-07T20:32:09.1232935Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:09.1233392Z 2025-05-07T20:32:09.1233642Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.1233995Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:09.1234305Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:09.1234626Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:09.1235000Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.1235321Z 2025-05-07T20:32:09.1235534Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:09.1235739Z 2025-05-07T20:32:09.1235854Z moe/activation_test.py:126: 2025-05-07T20:32:09.1236166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1236518Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:09.1236851Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.1237653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:09.1238421Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:09.1238969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.1239665Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.1240360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:09.1241094Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.1241818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:09.1242465Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:09.1243077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:09.1243608Z fn() 2025-05-07T20:32:09.1244124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:09.1244714Z self.fn.run( 2025-05-07T20:32:09.1245197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.1245727Z kernel = self.compile( 2025-05-07T20:32:09.1246278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.1246940Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.1247346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.1247576Z 2025-05-07T20:32:09.1247786Z self = 2025-05-07T20:32:09.1248865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.1250250Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58feb420>} 2025-05-07T20:32:09.1251689Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.1252707Z context = 2025-05-07T20:32:09.1253119Z 2025-05-07T20:32:09.1253294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.1253824Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.1254303Z module_map=module_map) 2025-05-07T20:32:09.1254757Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.1255135Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:09.1255423Z E ^ 2025-05-07T20:32:09.1255936Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.1256394Z 2025-05-07T20:32:09.1256810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.1257326Z 2025-05-07T20:32:09.1257440Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.1257863Z self=, 2025-05-07T20:32:09.1258266Z T=2048, 2025-05-07T20:32:09.1258474Z D=5120, 2025-05-07T20:32:09.1258682Z scale_ub=None, 2025-05-07T20:32:09.1258902Z contiguous=True, 2025-05-07T20:32:09.1259139Z compiled=True, 2025-05-07T20:32:09.1259657Z ) 2025-05-07T20:32:09.3665975Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.3667089Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:09.3668428Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.3669840Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.3670808Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.3672107Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.3673461Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.3674449Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.3675672Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.3677030Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.3678091Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.3679694Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.3680921Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:09.3682150Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.3683491Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:09.3684316Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.3685323Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:09.3686336Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:09.3687123Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:09.3688322Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.3689592Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.3690683Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:09.3691715Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:09.3692880Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.3694322Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.3695374Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.3696321Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.3697059Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:09.3698059Z W0507 20:32:09.363000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.4375570Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.4377677Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:09.4380294Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.4383596Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.4385509Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4386796Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.4388334Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.4389301Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4390522Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.4391878Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.4392938Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4394199Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.4395432Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:09.4396695Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.4397888Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:09.4398703Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.4399716Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:09.4400719Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:09.4401512Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:09.4402696Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.4403967Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.4405073Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:09.4406154Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:09.4407399Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.4408729Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.4409849Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.4410746Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.4411484Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:09.4412480Z W0507 20:32:09.434000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.6445656Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.6446770Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:09.6448141Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.6449563Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.6450548Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.6451833Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.6453291Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.6454270Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.6455507Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.6456872Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.6457931Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.6459551Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.6460799Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:09.6462337Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.6463542Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:09.6464370Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.6465520Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:09.6466582Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:09.6467376Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:09.6468572Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.6469842Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.6470952Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:09.6471993Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:09.6473172Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.6474515Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.6475578Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.6476540Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.6477277Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:09.6478290Z W0507 20:32:09.641000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.6546565Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.6547600Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:09.6548920Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.6550323Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.6551295Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.6552691Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.6554061Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.6555105Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.6556320Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.6557687Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.6558742Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.6560233Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.6561472Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:09.6562684Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.6563889Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:09.6564715Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.6565782Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:09.6566797Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:09.6567587Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:09.6568798Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.6570068Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.6571169Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:09.6572204Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:09.6573422Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.6574951Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.6576005Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.6576899Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.6577751Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:09.6578755Z W0507 20:32:09.651000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.8666109Z self = 2025-05-07T20:32:09.8666813Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:09.8667084Z 2025-05-07T20:32:09.8667166Z @given( 2025-05-07T20:32:09.8667416Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.8667737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.8668053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.8668379Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.8668716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.8669021Z ) 2025-05-07T20:32:09.8669371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.8669825Z def test_silu_mul_quant( 2025-05-07T20:32:09.8670073Z self, 2025-05-07T20:32:09.8670267Z T: int, 2025-05-07T20:32:09.8670470Z D: int, 2025-05-07T20:32:09.8670700Z scale_ub: Optional[float], 2025-05-07T20:32:09.8670969Z contiguous: bool, 2025-05-07T20:32:09.8671215Z compiled: bool, 2025-05-07T20:32:09.8671453Z ) -> None: 2025-05-07T20:32:09.8671668Z torch.manual_seed(2025) 2025-05-07T20:32:09.8671915Z 2025-05-07T20:32:09.8672188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.8672527Z 2025-05-07T20:32:09.8672728Z x_sign = torch.sign(x) 2025-05-07T20:32:09.8673021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.8673331Z x = x_sign * x_clamp 2025-05-07T20:32:09.8673582Z x0 = x[:, :D] 2025-05-07T20:32:09.8673808Z x1 = x[:, D:] 2025-05-07T20:32:09.8674027Z 2025-05-07T20:32:09.8674218Z if contiguous: 2025-05-07T20:32:09.8674458Z x0 = x0.contiguous() 2025-05-07T20:32:09.8674717Z x1 = x1.contiguous() 2025-05-07T20:32:09.8674952Z 2025-05-07T20:32:09.8675147Z if scale_ub is not None: 2025-05-07T20:32:09.8675428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.8675772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.8676133Z ) 2025-05-07T20:32:09.8676337Z else: 2025-05-07T20:32:09.8676548Z scale_ub_tensor = None 2025-05-07T20:32:09.8676810Z 2025-05-07T20:32:09.8677052Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.8677359Z op = silu_mul_quant 2025-05-07T20:32:09.8677612Z if compiled: 2025-05-07T20:32:09.8677871Z op = torch.compile(op) 2025-05-07T20:32:09.8678182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.8678458Z 2025-05-07T20:32:09.8678662Z y_fp8, y_scale = fn() 2025-05-07T20:32:09.8678967Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:09.8679261Z 2025-05-07T20:32:09.8679509Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.8679852Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:09.8680145Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:09.8680824Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:09.8681188Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.8681493Z 2025-05-07T20:32:09.8681710Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:09.8681911Z 2025-05-07T20:32:09.8682017Z moe/activation_test.py:126: 2025-05-07T20:32:09.8682321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8682797Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:09.8683127Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.8683914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:09.8684656Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:09.8685200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.8685937Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.8686620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:09.8687328Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.8688050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:09.8688690Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:09.8689290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:09.8689796Z fn() 2025-05-07T20:32:09.8690301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:09.8690878Z self.fn.run( 2025-05-07T20:32:09.8691344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.8691873Z kernel = self.compile( 2025-05-07T20:32:09.8692412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.8693161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.8693557Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.8693794Z 2025-05-07T20:32:09.8694002Z self = 2025-05-07T20:32:09.8695071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.8696448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58561800>} 2025-05-07T20:32:09.8697770Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.8698789Z context = 2025-05-07T20:32:09.8699078Z 2025-05-07T20:32:09.8699251Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.8699773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.8700236Z module_map=module_map) 2025-05-07T20:32:09.8700610Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.8700975Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:09.8701244Z E ^ 2025-05-07T20:32:09.8701809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.8702265Z 2025-05-07T20:32:09.8702678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.8703181Z 2025-05-07T20:32:09.8703293Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.8703703Z self=, 2025-05-07T20:32:09.8704196Z T=128, 2025-05-07T20:32:09.8704395Z D=5120, 2025-05-07T20:32:09.8704588Z scale_ub=None, 2025-05-07T20:32:09.8704808Z contiguous=True, 2025-05-07T20:32:09.8705040Z compiled=True, 2025-05-07T20:32:09.8705253Z ) 2025-05-07T20:32:10.1189819Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.1191753Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:10.1194080Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.1196471Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.1198221Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.1200544Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.1203012Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.1204762Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.1206934Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.1209363Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.1211094Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.1213464Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.1215634Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:10.1217743Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.1219790Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:10.1221569Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.1223316Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:10.1225036Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:10.1226653Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:10.1228774Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.1230949Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.1232932Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.1234736Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:10.1236806Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.1239203Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.1241008Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.1242593Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.1243852Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:10.1245652Z W0507 20:32:10.115000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.1921610Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.1933678Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:10.1936093Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.1938652Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.1940357Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.1942699Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.1945523Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.1947284Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.1949254Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.1951936Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.1953778Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.1955916Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.1958020Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:10.1960445Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.1962493Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:10.1963820Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.1965616Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:10.1967444Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:10.1968787Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:10.1970870Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.1973180Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.1975103Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.1976785Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:10.1978582Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.1980698Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.1982459Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.1984255Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.1985401Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:10.1986896Z W0507 20:32:10.189000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.4026914Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.4029238Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:10.4031514Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.4034052Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.4035639Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.4037757Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.4039992Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.4041611Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.4043621Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.4045840Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.4047538Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.4049622Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.4051751Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:10.4054043Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.4056113Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:10.4057578Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.4059747Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:10.4061757Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:10.4063183Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:10.4065346Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.4067810Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.4069785Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.4071647Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:10.4073753Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.4076183Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.4078078Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.4079670Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.4080976Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:10.4082805Z W0507 20:32:10.399000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.4147825Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.4149599Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:10.4151878Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.4154373Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.4156023Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.4158125Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.4160673Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.4162302Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.4164530Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.4166946Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.4168928Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.4171181Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.4173500Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:10.4175585Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.4177750Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:10.4179222Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.4180956Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:10.4182674Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:10.4184080Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:10.4186240Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.4188489Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.4190371Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.4192146Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:10.4194104Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.4196367Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.4198214Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.4199710Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.4200967Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:10.4202859Z W0507 20:32:10.411000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6681183Z self = 2025-05-07T20:32:10.6682051Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:10.6682497Z 2025-05-07T20:32:10.6682620Z @given( 2025-05-07T20:32:10.6682985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.6683911Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.6684408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.6684943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.6685461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.6685936Z ) 2025-05-07T20:32:10.6686501Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.6687205Z def test_silu_mul_quant( 2025-05-07T20:32:10.6687558Z self, 2025-05-07T20:32:10.6687869Z T: int, 2025-05-07T20:32:10.6688172Z D: int, 2025-05-07T20:32:10.6688515Z scale_ub: Optional[float], 2025-05-07T20:32:10.6688948Z contiguous: bool, 2025-05-07T20:32:10.6689327Z compiled: bool, 2025-05-07T20:32:10.6689687Z ) -> None: 2025-05-07T20:32:10.6690022Z torch.manual_seed(2025) 2025-05-07T20:32:10.6690410Z 2025-05-07T20:32:10.6690828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.6691378Z 2025-05-07T20:32:10.6691700Z x_sign = torch.sign(x) 2025-05-07T20:32:10.6692174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.6692684Z x = x_sign * x_clamp 2025-05-07T20:32:10.6693203Z x0 = x[:, :D] 2025-05-07T20:32:10.6693533Z x1 = x[:, D:] 2025-05-07T20:32:10.6693854Z 2025-05-07T20:32:10.6694152Z if contiguous: 2025-05-07T20:32:10.6694519Z x0 = x0.contiguous() 2025-05-07T20:32:10.6694948Z x1 = x1.contiguous() 2025-05-07T20:32:10.6695348Z 2025-05-07T20:32:10.6695640Z if scale_ub is not None: 2025-05-07T20:32:10.6696080Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.6696626Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.6697120Z ) 2025-05-07T20:32:10.6697409Z else: 2025-05-07T20:32:10.6697732Z scale_ub_tensor = None 2025-05-07T20:32:10.6698134Z 2025-05-07T20:32:10.6698492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.6698994Z op = silu_mul_quant 2025-05-07T20:32:10.6699390Z if compiled: 2025-05-07T20:32:10.6699770Z op = torch.compile(op) 2025-05-07T20:32:10.6700240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.6700679Z 2025-05-07T20:32:10.6700974Z y_fp8, y_scale = fn() 2025-05-07T20:32:10.6701426Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:10.6701890Z 2025-05-07T20:32:10.6702266Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.6702790Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:10.6703264Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:10.6703766Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:10.6704359Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.6704885Z 2025-05-07T20:32:10.6705212Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:10.6705506Z 2025-05-07T20:32:10.6705658Z moe/activation_test.py:126: 2025-05-07T20:32:10.6706160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6706636Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:10.6707095Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.6708429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:10.6709485Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:10.6710236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.6711158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.6712117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:10.6713218Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:10.6714224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:10.6715104Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:10.6715929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:10.6716642Z fn() 2025-05-07T20:32:10.6717353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:10.6718151Z self.fn.run( 2025-05-07T20:32:10.6718786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.6719520Z kernel = self.compile( 2025-05-07T20:32:10.6720258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.6721163Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6721700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.6722010Z 2025-05-07T20:32:10.6722286Z self = 2025-05-07T20:32:10.6723780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.6725736Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58bad620>} 2025-05-07T20:32:10.6727866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.6729538Z context = 2025-05-07T20:32:10.6729968Z 2025-05-07T20:32:10.6730223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.6730952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6731608Z module_map=module_map) 2025-05-07T20:32:10.6732121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6732612Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:10.6732974Z E ^ 2025-05-07T20:32:10.6733773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6734437Z 2025-05-07T20:32:10.6735073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.6735872Z 2025-05-07T20:32:10.6736044Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.6736728Z self=, 2025-05-07T20:32:10.6737373Z T=4096, 2025-05-07T20:32:10.6737689Z D=5120, 2025-05-07T20:32:10.6738007Z scale_ub=None, 2025-05-07T20:32:10.6738376Z contiguous=True, 2025-05-07T20:32:10.6738718Z compiled=True, 2025-05-07T20:32:10.6739023Z ) 2025-05-07T20:32:10.9274356Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.9275706Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:10.9277047Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.9278641Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.9279613Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.9280911Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.9282284Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.9283276Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.9284482Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.9285841Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.9286896Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.9288171Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.9289405Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:10.9290611Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.9291799Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:10.9292621Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.9293749Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:10.9294763Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:10.9295546Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:10.9296908Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.9298180Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.9299294Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:10.9300399Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:10.9301559Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.9302915Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.9303972Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.9304880Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.9305618Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:10.9306636Z W0507 20:32:10.924000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.9985482Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.9986565Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:10.9987897Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.9989324Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.9990293Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.9991591Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.9992960Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.9993936Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.9995159Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.9996519Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.9997926Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.9999202Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.0000719Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:11.0001926Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.0003125Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:11.0003948Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.0004968Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:11.0005977Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:11.0006780Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:11.0007982Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.0009258Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.0010374Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:11.0011405Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:11.0012568Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.0014007Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.0015060Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.0015961Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.0016701Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:11.0017708Z W0507 20:32:10.995000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.2071648Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:11.2073252Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:11.2074586Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:11.2076019Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:11.2077141Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2078432Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:11.2079815Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.2080793Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2082018Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.2083396Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.2084448Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2085719Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.2086956Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:11.2088178Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.2089380Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:11.2090205Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2091404Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:11.2092425Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:11.2093301Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:11.2094499Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.2095759Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.2096951Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:11.2097987Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:11.2099154Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.2100582Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.2101628Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.2102535Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.2103274Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:11.2104276Z W0507 20:32:11.204000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.2168601Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:11.2169781Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:11.2171109Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:11.2172510Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:11.2173550Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2174847Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:11.2176219Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.2177193Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2178407Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.2179771Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.2180819Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2182260Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.2183492Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:11.2184691Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.2185993Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:11.2186867Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.2187879Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:11.2188880Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:11.2189658Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:11.2190849Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.2192099Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.2193207Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:11.2194241Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:11.2195400Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.2196789Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.2197826Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.2198721Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.2199453Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:11.2200464Z W0507 20:32:11.214000 97872 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.4725881Z self = 2025-05-07T20:32:11.4726518Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.4726901Z 2025-05-07T20:32:11.4726996Z @given( 2025-05-07T20:32:11.4727250Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.4727570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.4727876Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.4728213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.4728888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.4729190Z ) 2025-05-07T20:32:11.4729544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.4729997Z def test_silu_mul_quant( 2025-05-07T20:32:11.4730251Z self, 2025-05-07T20:32:11.4730445Z T: int, 2025-05-07T20:32:11.4730649Z D: int, 2025-05-07T20:32:11.4730872Z scale_ub: Optional[float], 2025-05-07T20:32:11.4731276Z contiguous: bool, 2025-05-07T20:32:11.4731522Z compiled: bool, 2025-05-07T20:32:11.4731759Z ) -> None: 2025-05-07T20:32:11.4731972Z torch.manual_seed(2025) 2025-05-07T20:32:11.4732224Z 2025-05-07T20:32:11.4732502Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.4732840Z 2025-05-07T20:32:11.4733113Z x_sign = torch.sign(x) 2025-05-07T20:32:11.4733409Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.4733719Z x = x_sign * x_clamp 2025-05-07T20:32:11.4733969Z x0 = x[:, :D] 2025-05-07T20:32:11.4734197Z x1 = x[:, D:] 2025-05-07T20:32:11.4734408Z 2025-05-07T20:32:11.4734599Z if contiguous: 2025-05-07T20:32:11.4734836Z x0 = x0.contiguous() 2025-05-07T20:32:11.4735098Z x1 = x1.contiguous() 2025-05-07T20:32:11.4735337Z 2025-05-07T20:32:11.4735538Z if scale_ub is not None: 2025-05-07T20:32:11.4735824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.4736166Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.4736484Z ) 2025-05-07T20:32:11.4736685Z else: 2025-05-07T20:32:11.4736900Z scale_ub_tensor = None 2025-05-07T20:32:11.4737155Z 2025-05-07T20:32:11.4737399Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.4737711Z op = silu_mul_quant 2025-05-07T20:32:11.4737971Z if compiled: 2025-05-07T20:32:11.4738228Z op = torch.compile(op) 2025-05-07T20:32:11.4738538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.4738821Z 2025-05-07T20:32:11.4739021Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.4739308Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.4739600Z 2025-05-07T20:32:11.4739850Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.4740204Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.4740507Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.4740832Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.4741208Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.4749084Z 2025-05-07T20:32:11.4749422Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.4749641Z 2025-05-07T20:32:11.4749748Z moe/activation_test.py:126: 2025-05-07T20:32:11.4750058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4750401Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.4750735Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.4751520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.4752263Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.4752803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.4753490Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.4754170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.4754875Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.4755596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.4756382Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.4756986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.4757496Z fn() 2025-05-07T20:32:11.4758018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.4758746Z self.fn.run( 2025-05-07T20:32:11.4759580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.4760191Z kernel = self.compile( 2025-05-07T20:32:11.4760742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.4761397Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.4761799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.4762042Z 2025-05-07T20:32:11.4762252Z self = 2025-05-07T20:32:11.4763332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.4764705Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a585fcae0>} 2025-05-07T20:32:11.4766040Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.4767096Z context = 2025-05-07T20:32:11.4767389Z 2025-05-07T20:32:11.4767566Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.4768086Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.4768557Z module_map=module_map) 2025-05-07T20:32:11.4768928Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.4769296Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.4769573Z E ^ 2025-05-07T20:32:11.4770044Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.4770498Z 2025-05-07T20:32:11.4770912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.4771416Z 2025-05-07T20:32:11.4771535Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.4771965Z self=, 2025-05-07T20:32:11.4772365Z T=16384, 2025-05-07T20:32:11.4772581Z D=5120, 2025-05-07T20:32:11.4772790Z scale_ub=None, 2025-05-07T20:32:11.4773099Z contiguous=True, 2025-05-07T20:32:11.4773337Z compiled=True, 2025-05-07T20:32:11.4773564Z ) 2025-05-07T20:32:11.5050897Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:11.5053442Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:11.5056094Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:11.5057232Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:11.5058655Z W0507 20:32:11.503000 97872 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:11.5935262Z self = 2025-05-07T20:32:11.5935796Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.5936412Z 2025-05-07T20:32:11.5936534Z @given( 2025-05-07T20:32:11.5936853Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.5937287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.5937712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.5938106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.5938445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.5938740Z ) 2025-05-07T20:32:11.5939094Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.5939554Z def test_silu_mul_quant( 2025-05-07T20:32:11.5939813Z self, 2025-05-07T20:32:11.5940015Z T: int, 2025-05-07T20:32:11.5940231Z D: int, 2025-05-07T20:32:11.5940461Z scale_ub: Optional[float], 2025-05-07T20:32:11.5940748Z contiguous: bool, 2025-05-07T20:32:11.5940996Z compiled: bool, 2025-05-07T20:32:11.5941243Z ) -> None: 2025-05-07T20:32:11.5941484Z torch.manual_seed(2025) 2025-05-07T20:32:11.5941738Z 2025-05-07T20:32:11.5942031Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.5942390Z 2025-05-07T20:32:11.5942595Z x_sign = torch.sign(x) 2025-05-07T20:32:11.5942911Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.5943242Z x = x_sign * x_clamp 2025-05-07T20:32:11.5943494Z x0 = x[:, :D] 2025-05-07T20:32:11.5943732Z x1 = x[:, D:] 2025-05-07T20:32:11.5943962Z 2025-05-07T20:32:11.5944166Z if contiguous: 2025-05-07T20:32:11.5944420Z x0 = x0.contiguous() 2025-05-07T20:32:11.5944700Z x1 = x1.contiguous() 2025-05-07T20:32:11.5944956Z 2025-05-07T20:32:11.5945168Z if scale_ub is not None: 2025-05-07T20:32:11.5945464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.5945817Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.5946153Z ) 2025-05-07T20:32:11.5946379Z else: 2025-05-07T20:32:11.5946614Z scale_ub_tensor = None 2025-05-07T20:32:11.5946880Z 2025-05-07T20:32:11.5947132Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5947468Z op = silu_mul_quant 2025-05-07T20:32:11.5947729Z if compiled: 2025-05-07T20:32:11.5947997Z op = torch.compile(op) 2025-05-07T20:32:11.5948312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.5948597Z 2025-05-07T20:32:11.5948810Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.5949123Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.5949422Z 2025-05-07T20:32:11.5949679Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.5950032Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.5950334Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.5950664Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.5951045Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5951369Z 2025-05-07T20:32:11.5951584Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.5951789Z 2025-05-07T20:32:11.5951898Z moe/activation_test.py:126: 2025-05-07T20:32:11.5952213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5952561Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.5952904Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.5953888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.5954649Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.5955199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.5955887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.5956666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.5957398Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.5958125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.5958772Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.5959669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.5960202Z fn() 2025-05-07T20:32:11.5960714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.5961305Z self.fn.run( 2025-05-07T20:32:11.5961786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.5962324Z kernel = self.compile( 2025-05-07T20:32:11.5962888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.5963549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.5963957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.5964188Z 2025-05-07T20:32:11.5964397Z self = 2025-05-07T20:32:11.5965479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.5966914Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a586b2660>} 2025-05-07T20:32:11.5968245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.5969257Z context = 2025-05-07T20:32:11.5969554Z 2025-05-07T20:32:11.5969724Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.5970266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.5970753Z module_map=module_map) 2025-05-07T20:32:11.5971121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.5971495Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.5971776Z E ^ 2025-05-07T20:32:11.5972242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.5972704Z 2025-05-07T20:32:11.5973192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.5973706Z 2025-05-07T20:32:11.5973814Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.5974245Z self=, 2025-05-07T20:32:11.5974649Z T=1, 2025-05-07T20:32:11.5974854Z D=5120, 2025-05-07T20:32:11.5975061Z scale_ub=1200.0, 2025-05-07T20:32:11.5975294Z contiguous=True, 2025-05-07T20:32:11.5975709Z compiled=True, 2025-05-07T20:32:11.5975936Z ) 2025-05-07T20:32:11.7366637Z self = 2025-05-07T20:32:11.7367385Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:11.7367735Z 2025-05-07T20:32:11.7367844Z @given( 2025-05-07T20:32:11.7368123Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.7368847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.7369156Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.7369497Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.7369830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.7370125Z ) 2025-05-07T20:32:11.7370475Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.7370918Z def test_silu_mul_quant( 2025-05-07T20:32:11.7371171Z self, 2025-05-07T20:32:11.7371371Z T: int, 2025-05-07T20:32:11.7371594Z D: int, 2025-05-07T20:32:11.7371827Z scale_ub: Optional[float], 2025-05-07T20:32:11.7372099Z contiguous: bool, 2025-05-07T20:32:11.7372350Z compiled: bool, 2025-05-07T20:32:11.7372582Z ) -> None: 2025-05-07T20:32:11.7372797Z torch.manual_seed(2025) 2025-05-07T20:32:11.7373133Z 2025-05-07T20:32:11.7373417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.7373770Z 2025-05-07T20:32:11.7373973Z x_sign = torch.sign(x) 2025-05-07T20:32:11.7374271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.7374603Z x = x_sign * x_clamp 2025-05-07T20:32:11.7374847Z x0 = x[:, :D] 2025-05-07T20:32:11.7375071Z x1 = x[:, D:] 2025-05-07T20:32:11.7375282Z 2025-05-07T20:32:11.7375469Z if contiguous: 2025-05-07T20:32:11.7375726Z x0 = x0.contiguous() 2025-05-07T20:32:11.7375991Z x1 = x1.contiguous() 2025-05-07T20:32:11.7376241Z 2025-05-07T20:32:11.7376453Z if scale_ub is not None: 2025-05-07T20:32:11.7376748Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.7377088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.7377412Z ) 2025-05-07T20:32:11.7377618Z else: 2025-05-07T20:32:11.7377835Z scale_ub_tensor = None 2025-05-07T20:32:11.7378100Z 2025-05-07T20:32:11.7378353Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.7378674Z op = silu_mul_quant 2025-05-07T20:32:11.7378936Z if compiled: 2025-05-07T20:32:11.7379203Z op = torch.compile(op) 2025-05-07T20:32:11.7379511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7379793Z 2025-05-07T20:32:11.7380001Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.7380173Z 2025-05-07T20:32:11.7380284Z moe/activation_test.py:117: 2025-05-07T20:32:11.7380589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7380930Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.7381215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.7381769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.7382330Z return fn(*args, **kwargs) 2025-05-07T20:32:11.7382991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.7383688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.7384221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.7384897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.7385567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.7386245Z kernel = self.compile( 2025-05-07T20:32:11.7386797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.7387447Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.7387851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.7388081Z 2025-05-07T20:32:11.7388289Z self = 2025-05-07T20:32:11.7389440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.7390819Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7d9e0>} 2025-05-07T20:32:11.7392152Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.7393165Z context = 2025-05-07T20:32:11.7393450Z 2025-05-07T20:32:11.7393617Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.7394144Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.7394620Z module_map=module_map) 2025-05-07T20:32:11.7394985Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.7395348Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.7395619Z E ^ 2025-05-07T20:32:11.7396091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.7396537Z 2025-05-07T20:32:11.7396961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.7397470Z 2025-05-07T20:32:11.7397575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.7397990Z self=, 2025-05-07T20:32:11.7398392Z T=1, 2025-05-07T20:32:11.7398586Z D=5120, 2025-05-07T20:32:11.7398795Z scale_ub=None, 2025-05-07T20:32:11.7399028Z contiguous=False, 2025-05-07T20:32:11.7399253Z compiled=True, 2025-05-07T20:32:11.7399462Z ) 2025-05-07T20:32:11.8011606Z self = 2025-05-07T20:32:11.8012320Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.8012676Z 2025-05-07T20:32:11.8012788Z @given( 2025-05-07T20:32:11.8013124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.8013452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.8013779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.8014115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.8014458Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.8014754Z ) 2025-05-07T20:32:11.8015101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.8015553Z def test_silu_mul_quant( 2025-05-07T20:32:11.8015807Z self, 2025-05-07T20:32:11.8016001Z T: int, 2025-05-07T20:32:11.8016202Z D: int, 2025-05-07T20:32:11.8016445Z scale_ub: Optional[float], 2025-05-07T20:32:11.8016741Z contiguous: bool, 2025-05-07T20:32:11.8016988Z compiled: bool, 2025-05-07T20:32:11.8017216Z ) -> None: 2025-05-07T20:32:11.8017435Z torch.manual_seed(2025) 2025-05-07T20:32:11.8017685Z 2025-05-07T20:32:11.8017963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.8018306Z 2025-05-07T20:32:11.8018809Z x_sign = torch.sign(x) 2025-05-07T20:32:11.8019115Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.8019424Z x = x_sign * x_clamp 2025-05-07T20:32:11.8019676Z x0 = x[:, :D] 2025-05-07T20:32:11.8019900Z x1 = x[:, D:] 2025-05-07T20:32:11.8020112Z 2025-05-07T20:32:11.8020308Z if contiguous: 2025-05-07T20:32:11.8020544Z x0 = x0.contiguous() 2025-05-07T20:32:11.8020948Z x1 = x1.contiguous() 2025-05-07T20:32:11.8021191Z 2025-05-07T20:32:11.8021392Z if scale_ub is not None: 2025-05-07T20:32:11.8021672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.8022010Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.8022323Z ) 2025-05-07T20:32:11.8022516Z else: 2025-05-07T20:32:11.8022726Z scale_ub_tensor = None 2025-05-07T20:32:11.8022980Z 2025-05-07T20:32:11.8023223Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.8023536Z op = silu_mul_quant 2025-05-07T20:32:11.8023795Z if compiled: 2025-05-07T20:32:11.8024054Z op = torch.compile(op) 2025-05-07T20:32:11.8024350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.8024631Z 2025-05-07T20:32:11.8024832Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.8025117Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.8025420Z 2025-05-07T20:32:11.8025667Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.8026005Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.8026302Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.8026619Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.8026983Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.8027291Z 2025-05-07T20:32:11.8027503Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.8027702Z 2025-05-07T20:32:11.8027814Z moe/activation_test.py:126: 2025-05-07T20:32:11.8028111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.8028453Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.8028788Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.8029577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.8030331Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.8030877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.8031557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.8032237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.8032956Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.8033679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.8034315Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.8034912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.8035431Z fn() 2025-05-07T20:32:11.8035950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.8036527Z self.fn.run( 2025-05-07T20:32:11.8036992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.8037523Z kernel = self.compile( 2025-05-07T20:32:11.8038060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.8038795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.8039200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.8039434Z 2025-05-07T20:32:11.8039642Z self = 2025-05-07T20:32:11.8040715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.8042165Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7f1a0>} 2025-05-07T20:32:11.8043496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.8044508Z context = 2025-05-07T20:32:11.8044791Z 2025-05-07T20:32:11.8044964Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.8045485Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.8045952Z module_map=module_map) 2025-05-07T20:32:11.8046331Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.8046744Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.8047014Z E ^ 2025-05-07T20:32:11.8047480Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.8047926Z 2025-05-07T20:32:11.8048344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.8048848Z 2025-05-07T20:32:11.8048967Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.8049383Z self=, 2025-05-07T20:32:11.8049787Z T=1, 2025-05-07T20:32:11.8049976Z D=5120, 2025-05-07T20:32:11.8050175Z scale_ub=None, 2025-05-07T20:32:11.8050404Z contiguous=True, 2025-05-07T20:32:11.8050638Z compiled=False, 2025-05-07T20:32:11.8050852Z ) 2025-05-07T20:32:11.9549368Z self = 2025-05-07T20:32:11.9550056Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:11.9550412Z 2025-05-07T20:32:11.9550527Z @given( 2025-05-07T20:32:11.9550825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.9551236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.9551606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.9551931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.9552282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.9552567Z ) 2025-05-07T20:32:11.9552909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.9553350Z def test_silu_mul_quant( 2025-05-07T20:32:11.9553604Z self, 2025-05-07T20:32:11.9553800Z T: int, 2025-05-07T20:32:11.9553995Z D: int, 2025-05-07T20:32:11.9554229Z scale_ub: Optional[float], 2025-05-07T20:32:11.9554498Z contiguous: bool, 2025-05-07T20:32:11.9554738Z compiled: bool, 2025-05-07T20:32:11.9554978Z ) -> None: 2025-05-07T20:32:11.9555197Z torch.manual_seed(2025) 2025-05-07T20:32:11.9555439Z 2025-05-07T20:32:11.9555723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.9556067Z 2025-05-07T20:32:11.9556255Z x_sign = torch.sign(x) 2025-05-07T20:32:11.9556557Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.9557219Z x = x_sign * x_clamp 2025-05-07T20:32:11.9557459Z x0 = x[:, :D] 2025-05-07T20:32:11.9557683Z x1 = x[:, D:] 2025-05-07T20:32:11.9557897Z 2025-05-07T20:32:11.9558082Z if contiguous: 2025-05-07T20:32:11.9558319Z x0 = x0.contiguous() 2025-05-07T20:32:11.9558582Z x1 = x1.contiguous() 2025-05-07T20:32:11.9558822Z 2025-05-07T20:32:11.9559020Z if scale_ub is not None: 2025-05-07T20:32:11.9559865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.9560213Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.9560518Z ) 2025-05-07T20:32:11.9560735Z else: 2025-05-07T20:32:11.9560957Z scale_ub_tensor = None 2025-05-07T20:32:11.9561210Z 2025-05-07T20:32:11.9561459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.9561800Z op = silu_mul_quant 2025-05-07T20:32:11.9570304Z if compiled: 2025-05-07T20:32:11.9570585Z op = torch.compile(op) 2025-05-07T20:32:11.9570890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9571162Z 2025-05-07T20:32:11.9571360Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.9571523Z 2025-05-07T20:32:11.9571631Z moe/activation_test.py:117: 2025-05-07T20:32:11.9571922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9572261Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.9572552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9573326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.9574011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.9574548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.9575223Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.9575883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.9576413Z kernel = self.compile( 2025-05-07T20:32:11.9576960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.9577599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.9578007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9578246Z 2025-05-07T20:32:11.9578451Z self = 2025-05-07T20:32:11.9579521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.9580969Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7e700>} 2025-05-07T20:32:11.9582291Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.9583294Z context = 2025-05-07T20:32:11.9583589Z 2025-05-07T20:32:11.9583755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.9584295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.9584769Z module_map=module_map) 2025-05-07T20:32:11.9585143Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.9585501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.9585764Z E ^ 2025-05-07T20:32:11.9586434Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.9586878Z 2025-05-07T20:32:11.9587288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.9587797Z 2025-05-07T20:32:11.9587905Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.9588319Z self=, 2025-05-07T20:32:11.9588843Z T=128, 2025-05-07T20:32:11.9589028Z D=5120, 2025-05-07T20:32:11.9589224Z scale_ub=None, 2025-05-07T20:32:11.9589448Z contiguous=False, 2025-05-07T20:32:11.9589673Z compiled=True, 2025-05-07T20:32:11.9589893Z ) 2025-05-07T20:32:11.9590213Z self = 2025-05-07T20:32:11.9590692Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:11.9590960Z 2025-05-07T20:32:11.9591043Z @given( 2025-05-07T20:32:11.9591280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.9591584Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.9591889Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.9592214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.9592539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.9592823Z ) 2025-05-07T20:32:11.9593169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.9593609Z def test_silu_mul_quant( 2025-05-07T20:32:11.9593845Z self, 2025-05-07T20:32:11.9594047Z T: int, 2025-05-07T20:32:11.9594251Z D: int, 2025-05-07T20:32:11.9594469Z scale_ub: Optional[float], 2025-05-07T20:32:11.9594733Z contiguous: bool, 2025-05-07T20:32:11.9594981Z compiled: bool, 2025-05-07T20:32:11.9595193Z ) -> None: 2025-05-07T20:32:11.9595415Z torch.manual_seed(2025) 2025-05-07T20:32:11.9595654Z 2025-05-07T20:32:11.9595924Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.9596267Z 2025-05-07T20:32:11.9596470Z x_sign = torch.sign(x) 2025-05-07T20:32:11.9596754Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.9597068Z x = x_sign * x_clamp 2025-05-07T20:32:11.9597314Z x0 = x[:, :D] 2025-05-07T20:32:11.9597544Z x1 = x[:, D:] 2025-05-07T20:32:11.9597747Z 2025-05-07T20:32:11.9597941Z if contiguous: 2025-05-07T20:32:11.9598176Z x0 = x0.contiguous() 2025-05-07T20:32:11.9598427Z x1 = x1.contiguous() 2025-05-07T20:32:11.9598677Z 2025-05-07T20:32:11.9598878Z if scale_ub is not None: 2025-05-07T20:32:11.9599146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.9599490Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.9599802Z ) 2025-05-07T20:32:11.9600001Z else: 2025-05-07T20:32:11.9600227Z scale_ub_tensor = None 2025-05-07T20:32:11.9600484Z 2025-05-07T20:32:11.9600715Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.9601031Z op = silu_mul_quant 2025-05-07T20:32:11.9601282Z if compiled: 2025-05-07T20:32:11.9601528Z op = torch.compile(op) 2025-05-07T20:32:11.9601826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9602107Z 2025-05-07T20:32:11.9602305Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.9602466Z 2025-05-07T20:32:11.9602566Z moe/activation_test.py:117: 2025-05-07T20:32:11.9602863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9603197Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.9603477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.9604030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:11.9604669Z return fn(*args, **kwargs) 2025-05-07T20:32:11.9605312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.9605989Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.9606522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.9607314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.9607961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.9608486Z kernel = self.compile( 2025-05-07T20:32:11.9609025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.9609668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.9610063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.9610295Z 2025-05-07T20:32:11.9610500Z self = 2025-05-07T20:32:11.9611567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.9612922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7df80>} 2025-05-07T20:32:11.9614292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.9615303Z context = 2025-05-07T20:32:11.9615599Z 2025-05-07T20:32:11.9615763Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.9616284Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.9616744Z module_map=module_map) 2025-05-07T20:32:11.9617116Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.9617477Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.9617730Z E ^ 2025-05-07T20:32:11.9618193Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.9618640Z 2025-05-07T20:32:11.9619050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.9619548Z 2025-05-07T20:32:11.9619664Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.9620074Z self=, 2025-05-07T20:32:11.9620472Z T=128, 2025-05-07T20:32:11.9620669Z D=7168, 2025-05-07T20:32:11.9620874Z scale_ub=1200.0, 2025-05-07T20:32:11.9621096Z contiguous=False, 2025-05-07T20:32:11.9621334Z compiled=False, 2025-05-07T20:32:11.9621545Z ) 2025-05-07T20:32:12.0750244Z self = 2025-05-07T20:32:12.0751034Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.0751425Z 2025-05-07T20:32:12.0751538Z @given( 2025-05-07T20:32:12.0751826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.0752148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.0752455Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.0752789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.0753129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.0753411Z ) 2025-05-07T20:32:12.0754037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.0754487Z def test_silu_mul_quant( 2025-05-07T20:32:12.0754729Z self, 2025-05-07T20:32:12.0754938Z T: int, 2025-05-07T20:32:12.0755150Z D: int, 2025-05-07T20:32:12.0755378Z scale_ub: Optional[float], 2025-05-07T20:32:12.0755650Z contiguous: bool, 2025-05-07T20:32:12.0755895Z compiled: bool, 2025-05-07T20:32:12.0756272Z ) -> None: 2025-05-07T20:32:12.0756497Z torch.manual_seed(2025) 2025-05-07T20:32:12.0756749Z 2025-05-07T20:32:12.0757023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.0757362Z 2025-05-07T20:32:12.0757565Z x_sign = torch.sign(x) 2025-05-07T20:32:12.0757859Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.0758169Z x = x_sign * x_clamp 2025-05-07T20:32:12.0758414Z x0 = x[:, :D] 2025-05-07T20:32:12.0758642Z x1 = x[:, D:] 2025-05-07T20:32:12.0758850Z 2025-05-07T20:32:12.0759036Z if contiguous: 2025-05-07T20:32:12.0759586Z x0 = x0.contiguous() 2025-05-07T20:32:12.0759932Z x1 = x1.contiguous() 2025-05-07T20:32:12.0760190Z 2025-05-07T20:32:12.0760394Z if scale_ub is not None: 2025-05-07T20:32:12.0760678Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.0761022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.0761347Z ) 2025-05-07T20:32:12.0761541Z else: 2025-05-07T20:32:12.0761760Z scale_ub_tensor = None 2025-05-07T20:32:12.0762031Z 2025-05-07T20:32:12.0762271Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.0762593Z op = silu_mul_quant 2025-05-07T20:32:12.0762862Z if compiled: 2025-05-07T20:32:12.0763118Z op = torch.compile(op) 2025-05-07T20:32:12.0763415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0763709Z 2025-05-07T20:32:12.0763909Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.0764076Z 2025-05-07T20:32:12.0764180Z moe/activation_test.py:117: 2025-05-07T20:32:12.0764483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0764821Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.0765107Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0765820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.0766522Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.0767058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.0767749Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.0768420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.0768962Z kernel = self.compile( 2025-05-07T20:32:12.0769506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.0770164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.0770568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0770803Z 2025-05-07T20:32:12.0771020Z self = 2025-05-07T20:32:12.0772089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.0773529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43450360>} 2025-05-07T20:32:12.0775034Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.0776059Z context = 2025-05-07T20:32:12.0776345Z 2025-05-07T20:32:12.0776517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.0777186Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.0777662Z module_map=module_map) 2025-05-07T20:32:12.0778038Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.0778398Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.0778668Z E ^ 2025-05-07T20:32:12.0779139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.0779593Z 2025-05-07T20:32:12.0780013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.0780520Z 2025-05-07T20:32:12.0780631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.0781048Z self=, 2025-05-07T20:32:12.0781456Z T=128, 2025-05-07T20:32:12.0781653Z D=5120, 2025-05-07T20:32:12.0781857Z scale_ub=None, 2025-05-07T20:32:12.0782080Z contiguous=False, 2025-05-07T20:32:12.0782310Z compiled=False, 2025-05-07T20:32:12.0782530Z ) 2025-05-07T20:32:12.0782856Z self = 2025-05-07T20:32:12.0783345Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.0783623Z 2025-05-07T20:32:12.0783704Z @given( 2025-05-07T20:32:12.0783950Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.0784281Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.0784587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.0784929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.0785269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.0785554Z ) 2025-05-07T20:32:12.0785908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.0786362Z def test_silu_mul_quant( 2025-05-07T20:32:12.0786604Z self, 2025-05-07T20:32:12.0786813Z T: int, 2025-05-07T20:32:12.0787021Z D: int, 2025-05-07T20:32:12.0787242Z scale_ub: Optional[float], 2025-05-07T20:32:12.0787529Z contiguous: bool, 2025-05-07T20:32:12.0787782Z compiled: bool, 2025-05-07T20:32:12.0788013Z ) -> None: 2025-05-07T20:32:12.0788237Z torch.manual_seed(2025) 2025-05-07T20:32:12.0788489Z 2025-05-07T20:32:12.0788779Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.0789127Z 2025-05-07T20:32:12.0789337Z x_sign = torch.sign(x) 2025-05-07T20:32:12.0789642Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.0789951Z x = x_sign * x_clamp 2025-05-07T20:32:12.0790207Z x0 = x[:, :D] 2025-05-07T20:32:12.0790442Z x1 = x[:, D:] 2025-05-07T20:32:12.0790654Z 2025-05-07T20:32:12.0790861Z if contiguous: 2025-05-07T20:32:12.0791116Z x0 = x0.contiguous() 2025-05-07T20:32:12.0791379Z x1 = x1.contiguous() 2025-05-07T20:32:12.0791636Z 2025-05-07T20:32:12.0791841Z if scale_ub is not None: 2025-05-07T20:32:12.0792117Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.0792459Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.0792778Z ) 2025-05-07T20:32:12.0792975Z else: 2025-05-07T20:32:12.0793204Z scale_ub_tensor = None 2025-05-07T20:32:12.0793468Z 2025-05-07T20:32:12.0793794Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.0794112Z op = silu_mul_quant 2025-05-07T20:32:12.0794376Z if compiled: 2025-05-07T20:32:12.0794635Z op = torch.compile(op) 2025-05-07T20:32:12.0794934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0795217Z 2025-05-07T20:32:12.0795426Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.0795668Z 2025-05-07T20:32:12.0795775Z moe/activation_test.py:117: 2025-05-07T20:32:12.0796078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0796428Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.0796751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.0797438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.0798132Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.0798676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.0799351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.0800013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.0800548Z kernel = self.compile( 2025-05-07T20:32:12.0801093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.0801748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.0802150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.0802379Z 2025-05-07T20:32:12.0802596Z self = 2025-05-07T20:32:12.0803664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.0805024Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a434514e0>} 2025-05-07T20:32:12.0806348Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.0807365Z context = 2025-05-07T20:32:12.0807653Z 2025-05-07T20:32:12.0807829Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.0808343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.0808817Z module_map=module_map) 2025-05-07T20:32:12.0809200Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.0809567Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.0809835Z E ^ 2025-05-07T20:32:12.0810307Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.0810755Z 2025-05-07T20:32:12.0811175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.0811686Z 2025-05-07T20:32:12.0811799Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.0812219Z self=, 2025-05-07T20:32:12.0812630Z T=128, 2025-05-07T20:32:12.0812828Z D=5120, 2025-05-07T20:32:12.0813132Z scale_ub=1200.0, 2025-05-07T20:32:12.0813369Z contiguous=True, 2025-05-07T20:32:12.0813606Z compiled=False, 2025-05-07T20:32:12.0813816Z ) 2025-05-07T20:32:12.4546585Z self = 2025-05-07T20:32:12.4547348Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.4547689Z 2025-05-07T20:32:12.4547775Z @given( 2025-05-07T20:32:12.4548015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4548334Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4548652Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4549153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4549478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4549772Z ) 2025-05-07T20:32:12.4550123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4550568Z def test_silu_mul_quant( 2025-05-07T20:32:12.4550815Z self, 2025-05-07T20:32:12.4551016Z T: int, 2025-05-07T20:32:12.4551218Z D: int, 2025-05-07T20:32:12.4551444Z scale_ub: Optional[float], 2025-05-07T20:32:12.4551718Z contiguous: bool, 2025-05-07T20:32:12.4551962Z compiled: bool, 2025-05-07T20:32:12.4552193Z ) -> None: 2025-05-07T20:32:12.4552417Z torch.manual_seed(2025) 2025-05-07T20:32:12.4552661Z 2025-05-07T20:32:12.4552932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4553277Z 2025-05-07T20:32:12.4553484Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4553785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4554096Z x = x_sign * x_clamp 2025-05-07T20:32:12.4554347Z x0 = x[:, :D] 2025-05-07T20:32:12.4554563Z x1 = x[:, D:] 2025-05-07T20:32:12.4554779Z 2025-05-07T20:32:12.4554972Z if contiguous: 2025-05-07T20:32:12.4555201Z x0 = x0.contiguous() 2025-05-07T20:32:12.4555470Z x1 = x1.contiguous() 2025-05-07T20:32:12.4555720Z 2025-05-07T20:32:12.4555925Z if scale_ub is not None: 2025-05-07T20:32:12.4556205Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4556548Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4556904Z ) 2025-05-07T20:32:12.4557113Z else: 2025-05-07T20:32:12.4557343Z scale_ub_tensor = None 2025-05-07T20:32:12.4557614Z 2025-05-07T20:32:12.4557851Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4558179Z op = silu_mul_quant 2025-05-07T20:32:12.4558435Z if compiled: 2025-05-07T20:32:12.4558679Z op = torch.compile(op) 2025-05-07T20:32:12.4558977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4559590Z 2025-05-07T20:32:12.4559848Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4560020Z 2025-05-07T20:32:12.4560121Z moe/activation_test.py:117: 2025-05-07T20:32:12.4560422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4560750Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4561032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4561715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4562395Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4562919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4563600Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4564250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4564778Z kernel = self.compile( 2025-05-07T20:32:12.4565309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4565953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4566480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4566708Z 2025-05-07T20:32:12.4566924Z self = 2025-05-07T20:32:12.4567980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4569460Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43453560>} 2025-05-07T20:32:12.4570779Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4571789Z context = 2025-05-07T20:32:12.4572071Z 2025-05-07T20:32:12.4572236Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4572750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4573303Z module_map=module_map) 2025-05-07T20:32:12.4573669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4574019Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4574282Z E ^ 2025-05-07T20:32:12.4574744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4575184Z 2025-05-07T20:32:12.4575593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4576100Z 2025-05-07T20:32:12.4576205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4576619Z self=, 2025-05-07T20:32:12.4577014Z T=1, 2025-05-07T20:32:12.4577195Z D=7168, 2025-05-07T20:32:12.4577409Z scale_ub=1200.0, 2025-05-07T20:32:12.4577644Z contiguous=True, 2025-05-07T20:32:12.4577865Z compiled=True, 2025-05-07T20:32:12.4578078Z ) 2025-05-07T20:32:12.4578398Z self = 2025-05-07T20:32:12.4578884Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:12.4579154Z 2025-05-07T20:32:12.4587106Z @given( 2025-05-07T20:32:12.4587380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.4587713Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.4588029Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.4588366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.4588696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.4588985Z ) 2025-05-07T20:32:12.4589343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.4589780Z def test_silu_mul_quant( 2025-05-07T20:32:12.4590030Z self, 2025-05-07T20:32:12.4590233Z T: int, 2025-05-07T20:32:12.4590427Z D: int, 2025-05-07T20:32:12.4590647Z scale_ub: Optional[float], 2025-05-07T20:32:12.4590919Z contiguous: bool, 2025-05-07T20:32:12.4591157Z compiled: bool, 2025-05-07T20:32:12.4591389Z ) -> None: 2025-05-07T20:32:12.4591615Z torch.manual_seed(2025) 2025-05-07T20:32:12.4591856Z 2025-05-07T20:32:12.4592135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.4592480Z 2025-05-07T20:32:12.4592674Z x_sign = torch.sign(x) 2025-05-07T20:32:12.4592972Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.4593287Z x = x_sign * x_clamp 2025-05-07T20:32:12.4593537Z x0 = x[:, :D] 2025-05-07T20:32:12.4594379Z x1 = x[:, D:] 2025-05-07T20:32:12.4594603Z 2025-05-07T20:32:12.4594799Z if contiguous: 2025-05-07T20:32:12.4595030Z x0 = x0.contiguous() 2025-05-07T20:32:12.4595294Z x1 = x1.contiguous() 2025-05-07T20:32:12.4595540Z 2025-05-07T20:32:12.4595736Z if scale_ub is not None: 2025-05-07T20:32:12.4596021Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.4596362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.4596745Z ) 2025-05-07T20:32:12.4596946Z else: 2025-05-07T20:32:12.4597167Z scale_ub_tensor = None 2025-05-07T20:32:12.4597422Z 2025-05-07T20:32:12.4597662Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.4597984Z op = silu_mul_quant 2025-05-07T20:32:12.4598233Z if compiled: 2025-05-07T20:32:12.4598489Z op = torch.compile(op) 2025-05-07T20:32:12.4598796Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4599087Z 2025-05-07T20:32:12.4599277Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.4599450Z 2025-05-07T20:32:12.4599554Z moe/activation_test.py:117: 2025-05-07T20:32:12.4599860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4600188Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.4600473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.4601041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.4601595Z return fn(*args, **kwargs) 2025-05-07T20:32:12.4602259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.4602937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.4603471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.4604140Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.4604798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.4605322Z kernel = self.compile( 2025-05-07T20:32:12.4605861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.4606505Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.4606900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.4607126Z 2025-05-07T20:32:12.4607339Z self = 2025-05-07T20:32:12.4608405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.4609762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a42ce42c0>} 2025-05-07T20:32:12.4611089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.4612112Z context = 2025-05-07T20:32:12.4612397Z 2025-05-07T20:32:12.4612571Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.4613177Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.4613643Z module_map=module_map) 2025-05-07T20:32:12.4614012Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.4614450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.4614716Z E ^ 2025-05-07T20:32:12.4615183Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.4615627Z 2025-05-07T20:32:12.4616047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.4616557Z 2025-05-07T20:32:12.4616738Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.4617157Z self=, 2025-05-07T20:32:12.4617563Z T=1, 2025-05-07T20:32:12.4617762Z D=7168, 2025-05-07T20:32:12.4617957Z scale_ub=1200.0, 2025-05-07T20:32:12.4618190Z contiguous=False, 2025-05-07T20:32:12.4618428Z compiled=True, 2025-05-07T20:32:12.4618636Z ) 2025-05-07T20:32:12.6023830Z self = 2025-05-07T20:32:12.6024528Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.6024802Z 2025-05-07T20:32:12.6024895Z @given( 2025-05-07T20:32:12.6025132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.6025460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.6025905Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.6026307Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.6026760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.6027062Z ) 2025-05-07T20:32:12.6027426Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.6027872Z def test_silu_mul_quant( 2025-05-07T20:32:12.6028129Z self, 2025-05-07T20:32:12.6028338Z T: int, 2025-05-07T20:32:12.6028553Z D: int, 2025-05-07T20:32:12.6028790Z scale_ub: Optional[float], 2025-05-07T20:32:12.6029076Z contiguous: bool, 2025-05-07T20:32:12.6029315Z compiled: bool, 2025-05-07T20:32:12.6029569Z ) -> None: 2025-05-07T20:32:12.6029797Z torch.manual_seed(2025) 2025-05-07T20:32:12.6030045Z 2025-05-07T20:32:12.6030332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.6030679Z 2025-05-07T20:32:12.6030878Z x_sign = torch.sign(x) 2025-05-07T20:32:12.6031183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.6031511Z x = x_sign * x_clamp 2025-05-07T20:32:12.6031781Z x0 = x[:, :D] 2025-05-07T20:32:12.6031998Z x1 = x[:, D:] 2025-05-07T20:32:12.6032264Z 2025-05-07T20:32:12.6032543Z if contiguous: 2025-05-07T20:32:12.6032879Z x0 = x0.contiguous() 2025-05-07T20:32:12.6033273Z x1 = x1.contiguous() 2025-05-07T20:32:12.6033641Z 2025-05-07T20:32:12.6033925Z if scale_ub is not None: 2025-05-07T20:32:12.6034331Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.6034816Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.6035260Z ) 2025-05-07T20:32:12.6035553Z else: 2025-05-07T20:32:12.6035863Z scale_ub_tensor = None 2025-05-07T20:32:12.6036223Z 2025-05-07T20:32:12.6036471Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.6036812Z op = silu_mul_quant 2025-05-07T20:32:12.6037088Z if compiled: 2025-05-07T20:32:12.6037344Z op = torch.compile(op) 2025-05-07T20:32:12.6037649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.6037931Z 2025-05-07T20:32:12.6038125Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.6038300Z 2025-05-07T20:32:12.6038401Z moe/activation_test.py:117: 2025-05-07T20:32:12.6038705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.6039037Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.6039326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.6040190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.6040754Z return fn(*args, **kwargs) 2025-05-07T20:32:12.6041416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.6042104Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.6042646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.6043471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.6044139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.6044683Z kernel = self.compile( 2025-05-07T20:32:12.6045234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.6045889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.6046293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.6046522Z 2025-05-07T20:32:12.6046737Z self = 2025-05-07T20:32:12.6047824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.6049206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a42ce5120>} 2025-05-07T20:32:12.6050536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.6051557Z context = 2025-05-07T20:32:12.6051841Z 2025-05-07T20:32:12.6052014Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.6052526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.6053110Z module_map=module_map) 2025-05-07T20:32:12.6053483Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.6053849Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.6054107Z E ^ 2025-05-07T20:32:12.6054572Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.6055017Z 2025-05-07T20:32:12.6055439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.6055944Z 2025-05-07T20:32:12.6056057Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.6056472Z self=, 2025-05-07T20:32:12.6056904Z T=1, 2025-05-07T20:32:12.6057115Z D=7168, 2025-05-07T20:32:12.6057302Z scale_ub=None, 2025-05-07T20:32:12.6057524Z contiguous=False, 2025-05-07T20:32:12.6057759Z compiled=True, 2025-05-07T20:32:12.6057964Z ) 2025-05-07T20:32:12.6959849Z self = 2025-05-07T20:32:12.6960653Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.6960917Z 2025-05-07T20:32:12.6961006Z @given( 2025-05-07T20:32:12.6961239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.6961557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.6961862Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.6962189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.6962508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.6963150Z ) 2025-05-07T20:32:12.6963505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.6963942Z def test_silu_mul_quant( 2025-05-07T20:32:12.6964210Z self, 2025-05-07T20:32:12.6964413Z T: int, 2025-05-07T20:32:12.6964609Z D: int, 2025-05-07T20:32:12.6964835Z scale_ub: Optional[float], 2025-05-07T20:32:12.6965111Z contiguous: bool, 2025-05-07T20:32:12.6965494Z compiled: bool, 2025-05-07T20:32:12.6965729Z ) -> None: 2025-05-07T20:32:12.6965951Z torch.manual_seed(2025) 2025-05-07T20:32:12.6966195Z 2025-05-07T20:32:12.6966480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.6966835Z 2025-05-07T20:32:12.6967025Z x_sign = torch.sign(x) 2025-05-07T20:32:12.6967319Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.6967641Z x = x_sign * x_clamp 2025-05-07T20:32:12.6967931Z x0 = x[:, :D] 2025-05-07T20:32:12.6968148Z x1 = x[:, D:] 2025-05-07T20:32:12.6968363Z 2025-05-07T20:32:12.6968544Z if contiguous: 2025-05-07T20:32:12.6968781Z x0 = x0.contiguous() 2025-05-07T20:32:12.6969042Z x1 = x1.contiguous() 2025-05-07T20:32:12.6969281Z 2025-05-07T20:32:12.6969468Z if scale_ub is not None: 2025-05-07T20:32:12.6969745Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.6970087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.6970396Z ) 2025-05-07T20:32:12.6970596Z else: 2025-05-07T20:32:12.6970807Z scale_ub_tensor = None 2025-05-07T20:32:12.6971050Z 2025-05-07T20:32:12.6971279Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.6971597Z op = silu_mul_quant 2025-05-07T20:32:12.6971845Z if compiled: 2025-05-07T20:32:12.6972096Z op = torch.compile(op) 2025-05-07T20:32:12.6972398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.6972662Z 2025-05-07T20:32:12.6972855Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.6973207Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.6973486Z 2025-05-07T20:32:12.6973723Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.6974053Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.6974349Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.6974652Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.6975007Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.6975315Z 2025-05-07T20:32:12.6975511Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.6975710Z 2025-05-07T20:32:12.6975810Z moe/activation_test.py:126: 2025-05-07T20:32:12.6976109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.6976442Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.6976768Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.6977600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.6978352Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.6978889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.6979568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.6980250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.6980965Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.6981679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.6982396Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.6982992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.6983493Z fn() 2025-05-07T20:32:12.6984000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.6984653Z self.fn.run( 2025-05-07T20:32:12.6985125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.6985642Z kernel = self.compile( 2025-05-07T20:32:12.6986183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.6986835Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.6987224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.6987461Z 2025-05-07T20:32:12.6987673Z self = 2025-05-07T20:32:12.6988741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.6990114Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a438d7420>} 2025-05-07T20:32:12.6991434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.6992435Z context = 2025-05-07T20:32:12.6992724Z 2025-05-07T20:32:12.6992895Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.6993409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.6993874Z module_map=module_map) 2025-05-07T20:32:12.6994232Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.6994589Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.6994865Z E ^ 2025-05-07T20:32:12.6995334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.6995781Z 2025-05-07T20:32:12.6996189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.6996695Z 2025-05-07T20:32:12.6996799Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.6997260Z self=, 2025-05-07T20:32:12.6997654Z T=1, 2025-05-07T20:32:12.6997849Z D=5120, 2025-05-07T20:32:12.6998055Z scale_ub=1200.0, 2025-05-07T20:32:12.6998277Z contiguous=False, 2025-05-07T20:32:12.6998506Z compiled=True, 2025-05-07T20:32:12.6998719Z ) 2025-05-07T20:32:12.8525254Z self = 2025-05-07T20:32:12.8525785Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.8526065Z 2025-05-07T20:32:12.8526179Z @given( 2025-05-07T20:32:12.8526418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.8526730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.8527071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.8527428Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.8527757Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.8528052Z ) 2025-05-07T20:32:12.8528403Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.8529169Z def test_silu_mul_quant( 2025-05-07T20:32:12.8529420Z self, 2025-05-07T20:32:12.8529631Z T: int, 2025-05-07T20:32:12.8529835Z D: int, 2025-05-07T20:32:12.8530060Z scale_ub: Optional[float], 2025-05-07T20:32:12.8530339Z contiguous: bool, 2025-05-07T20:32:12.8530580Z compiled: bool, 2025-05-07T20:32:12.8530807Z ) -> None: 2025-05-07T20:32:12.8531032Z torch.manual_seed(2025) 2025-05-07T20:32:12.8531431Z 2025-05-07T20:32:12.8531703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.8532051Z 2025-05-07T20:32:12.8532254Z x_sign = torch.sign(x) 2025-05-07T20:32:12.8532540Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.8532851Z x = x_sign * x_clamp 2025-05-07T20:32:12.8533212Z x0 = x[:, :D] 2025-05-07T20:32:12.8533433Z x1 = x[:, D:] 2025-05-07T20:32:12.8533640Z 2025-05-07T20:32:12.8533820Z if contiguous: 2025-05-07T20:32:12.8534057Z x0 = x0.contiguous() 2025-05-07T20:32:12.8534315Z x1 = x1.contiguous() 2025-05-07T20:32:12.8534548Z 2025-05-07T20:32:12.8534738Z if scale_ub is not None: 2025-05-07T20:32:12.8535010Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.8535341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.8535649Z ) 2025-05-07T20:32:12.8535852Z else: 2025-05-07T20:32:12.8536057Z scale_ub_tensor = None 2025-05-07T20:32:12.8536314Z 2025-05-07T20:32:12.8536552Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8536869Z op = silu_mul_quant 2025-05-07T20:32:12.8537112Z if compiled: 2025-05-07T20:32:12.8537363Z op = torch.compile(op) 2025-05-07T20:32:12.8537659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8537929Z 2025-05-07T20:32:12.8538125Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.8538295Z 2025-05-07T20:32:12.8538402Z moe/activation_test.py:117: 2025-05-07T20:32:12.8538694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8539024Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.8539304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8539859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.8540421Z return fn(*args, **kwargs) 2025-05-07T20:32:12.8541076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.8541756Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.8542282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.8542962Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.8543628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.8544160Z kernel = self.compile( 2025-05-07T20:32:12.8544698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.8545351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.8545760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8545989Z 2025-05-07T20:32:12.8546201Z self = 2025-05-07T20:32:12.8547278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.8548748Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7fa60>} 2025-05-07T20:32:12.8550076Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.8551088Z context = 2025-05-07T20:32:12.8551448Z 2025-05-07T20:32:12.8551612Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.8552130Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.8552599Z module_map=module_map) 2025-05-07T20:32:12.8552966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.8553315Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.8553581Z E ^ 2025-05-07T20:32:12.8554050Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.8554493Z 2025-05-07T20:32:12.8554904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.8555416Z 2025-05-07T20:32:12.8555519Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.8555931Z self=, 2025-05-07T20:32:12.8556336Z T=1, 2025-05-07T20:32:12.8556516Z D=5120, 2025-05-07T20:32:12.8556713Z scale_ub=1200.0, 2025-05-07T20:32:12.8556946Z contiguous=False, 2025-05-07T20:32:12.8557207Z compiled=False, 2025-05-07T20:32:12.8557421Z ) 2025-05-07T20:32:12.8557744Z self = 2025-05-07T20:32:12.8558223Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:12.8558497Z 2025-05-07T20:32:12.8558577Z @given( 2025-05-07T20:32:12.8558819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.8559126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.8559687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.8560022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.8560353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.8560640Z ) 2025-05-07T20:32:12.8561004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.8561444Z def test_silu_mul_quant( 2025-05-07T20:32:12.8561680Z self, 2025-05-07T20:32:12.8561884Z T: int, 2025-05-07T20:32:12.8562089Z D: int, 2025-05-07T20:32:12.8562306Z scale_ub: Optional[float], 2025-05-07T20:32:12.8562581Z contiguous: bool, 2025-05-07T20:32:12.8562826Z compiled: bool, 2025-05-07T20:32:12.8563041Z ) -> None: 2025-05-07T20:32:12.8563257Z torch.manual_seed(2025) 2025-05-07T20:32:12.8563522Z 2025-05-07T20:32:12.8563797Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.8564150Z 2025-05-07T20:32:12.8564354Z x_sign = torch.sign(x) 2025-05-07T20:32:12.8564656Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.8564963Z x = x_sign * x_clamp 2025-05-07T20:32:12.8565212Z x0 = x[:, :D] 2025-05-07T20:32:12.8573789Z x1 = x[:, D:] 2025-05-07T20:32:12.8574024Z 2025-05-07T20:32:12.8574221Z if contiguous: 2025-05-07T20:32:12.8574465Z x0 = x0.contiguous() 2025-05-07T20:32:12.8574720Z x1 = x1.contiguous() 2025-05-07T20:32:12.8574964Z 2025-05-07T20:32:12.8575160Z if scale_ub is not None: 2025-05-07T20:32:12.8575435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.8575776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.8576097Z ) 2025-05-07T20:32:12.8576298Z else: 2025-05-07T20:32:12.8576691Z scale_ub_tensor = None 2025-05-07T20:32:12.8576957Z 2025-05-07T20:32:12.8577208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.8577527Z op = silu_mul_quant 2025-05-07T20:32:12.8577787Z if compiled: 2025-05-07T20:32:12.8578041Z op = torch.compile(op) 2025-05-07T20:32:12.8578333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8578744Z 2025-05-07T20:32:12.8578946Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.8579109Z 2025-05-07T20:32:12.8579211Z moe/activation_test.py:117: 2025-05-07T20:32:12.8579512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8579844Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.8580127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.8580811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.8581501Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.8582038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.8582708Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.8583376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.8583918Z kernel = self.compile( 2025-05-07T20:32:12.8584463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.8585104Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.8585508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.8585732Z 2025-05-07T20:32:12.8585952Z self = 2025-05-07T20:32:12.8587029Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.8588379Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43e0f6a0>} 2025-05-07T20:32:12.8589709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.8590721Z context = 2025-05-07T20:32:12.8591005Z 2025-05-07T20:32:12.8591181Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.8591696Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.8592166Z module_map=module_map) 2025-05-07T20:32:12.8592539Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.8592895Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.8593151Z E ^ 2025-05-07T20:32:12.8593622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.8594068Z 2025-05-07T20:32:12.8594491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.8594994Z 2025-05-07T20:32:12.8595116Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.8595525Z self=, 2025-05-07T20:32:12.8595933Z T=16384, 2025-05-07T20:32:12.8596136Z D=5120, 2025-05-07T20:32:12.8596325Z scale_ub=1200.0, 2025-05-07T20:32:12.8596555Z contiguous=False, 2025-05-07T20:32:12.8596870Z compiled=True, 2025-05-07T20:32:12.8597074Z ) 2025-05-07T20:32:12.9469469Z self = 2025-05-07T20:32:12.9470001Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.9470289Z 2025-05-07T20:32:12.9470381Z @given( 2025-05-07T20:32:12.9470624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.9471261Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.9471581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.9471921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.9472260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.9472558Z ) 2025-05-07T20:32:12.9472911Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.9473370Z def test_silu_mul_quant( 2025-05-07T20:32:12.9473622Z self, 2025-05-07T20:32:12.9473846Z T: int, 2025-05-07T20:32:12.9474045Z D: int, 2025-05-07T20:32:12.9474271Z scale_ub: Optional[float], 2025-05-07T20:32:12.9474555Z contiguous: bool, 2025-05-07T20:32:12.9474800Z compiled: bool, 2025-05-07T20:32:12.9475048Z ) -> None: 2025-05-07T20:32:12.9475282Z torch.manual_seed(2025) 2025-05-07T20:32:12.9475527Z 2025-05-07T20:32:12.9475816Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.9476170Z 2025-05-07T20:32:12.9476369Z x_sign = torch.sign(x) 2025-05-07T20:32:12.9476670Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.9476983Z x = x_sign * x_clamp 2025-05-07T20:32:12.9477219Z x0 = x[:, :D] 2025-05-07T20:32:12.9477473Z x1 = x[:, D:] 2025-05-07T20:32:12.9477682Z 2025-05-07T20:32:12.9477872Z if contiguous: 2025-05-07T20:32:12.9478110Z x0 = x0.contiguous() 2025-05-07T20:32:12.9478369Z x1 = x1.contiguous() 2025-05-07T20:32:12.9478619Z 2025-05-07T20:32:12.9478824Z if scale_ub is not None: 2025-05-07T20:32:12.9479100Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.9479442Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.9479760Z ) 2025-05-07T20:32:12.9479952Z else: 2025-05-07T20:32:12.9480176Z scale_ub_tensor = None 2025-05-07T20:32:12.9480445Z 2025-05-07T20:32:12.9480688Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.9481016Z op = silu_mul_quant 2025-05-07T20:32:12.9481280Z if compiled: 2025-05-07T20:32:12.9481540Z op = torch.compile(op) 2025-05-07T20:32:12.9481836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9482125Z 2025-05-07T20:32:12.9482329Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.9482500Z 2025-05-07T20:32:12.9482602Z moe/activation_test.py:117: 2025-05-07T20:32:12.9482915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9483252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.9483530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9484092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.9484659Z return fn(*args, **kwargs) 2025-05-07T20:32:12.9485328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.9486017Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.9486556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.9487242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.9487946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.9488637Z kernel = self.compile( 2025-05-07T20:32:12.9489192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.9489853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.9490256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9490498Z 2025-05-07T20:32:12.9490786Z self = 2025-05-07T20:32:12.9491867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.9493317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43f32b60>} 2025-05-07T20:32:12.9494655Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.9495667Z context = 2025-05-07T20:32:12.9495958Z 2025-05-07T20:32:12.9496127Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.9496655Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.9497114Z module_map=module_map) 2025-05-07T20:32:12.9497511Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.9497911Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.9498177Z E ^ 2025-05-07T20:32:12.9498638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.9499088Z 2025-05-07T20:32:12.9499507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.9500013Z 2025-05-07T20:32:12.9500128Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.9500543Z self=, 2025-05-07T20:32:12.9500936Z T=2048, 2025-05-07T20:32:12.9501140Z D=7168, 2025-05-07T20:32:12.9501345Z scale_ub=1200.0, 2025-05-07T20:32:12.9501572Z contiguous=False, 2025-05-07T20:32:12.9501800Z compiled=True, 2025-05-07T20:32:12.9502020Z ) 2025-05-07T20:32:12.9502341Z self = 2025-05-07T20:32:12.9502841Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:12.9503115Z 2025-05-07T20:32:12.9503210Z @given( 2025-05-07T20:32:12.9503441Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.9503768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.9504088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.9504420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.9504748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.9505048Z ) 2025-05-07T20:32:12.9505405Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.9505858Z def test_silu_mul_quant( 2025-05-07T20:32:12.9506112Z self, 2025-05-07T20:32:12.9506314Z T: int, 2025-05-07T20:32:12.9506522Z D: int, 2025-05-07T20:32:12.9506754Z scale_ub: Optional[float], 2025-05-07T20:32:12.9507081Z contiguous: bool, 2025-05-07T20:32:12.9507338Z compiled: bool, 2025-05-07T20:32:12.9507559Z ) -> None: 2025-05-07T20:32:12.9507783Z torch.manual_seed(2025) 2025-05-07T20:32:12.9508037Z 2025-05-07T20:32:12.9508307Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.9508743Z 2025-05-07T20:32:12.9508951Z x_sign = torch.sign(x) 2025-05-07T20:32:12.9509240Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.9509558Z x = x_sign * x_clamp 2025-05-07T20:32:12.9509813Z x0 = x[:, :D] 2025-05-07T20:32:12.9510034Z x1 = x[:, D:] 2025-05-07T20:32:12.9510253Z 2025-05-07T20:32:12.9510450Z if contiguous: 2025-05-07T20:32:12.9510801Z x0 = x0.contiguous() 2025-05-07T20:32:12.9511065Z x1 = x1.contiguous() 2025-05-07T20:32:12.9511314Z 2025-05-07T20:32:12.9511512Z if scale_ub is not None: 2025-05-07T20:32:12.9511798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.9512143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.9512456Z ) 2025-05-07T20:32:12.9512651Z else: 2025-05-07T20:32:12.9512866Z scale_ub_tensor = None 2025-05-07T20:32:12.9513126Z 2025-05-07T20:32:12.9513363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.9513680Z op = silu_mul_quant 2025-05-07T20:32:12.9513938Z if compiled: 2025-05-07T20:32:12.9514187Z op = torch.compile(op) 2025-05-07T20:32:12.9514493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9514786Z 2025-05-07T20:32:12.9514980Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.9515151Z 2025-05-07T20:32:12.9515259Z moe/activation_test.py:117: 2025-05-07T20:32:12.9515562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9515889Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.9516176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.9516733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:12.9517286Z return fn(*args, **kwargs) 2025-05-07T20:32:12.9517952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.9518638Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.9519174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.9519845Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.9520504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.9521036Z kernel = self.compile( 2025-05-07T20:32:12.9521580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.9522225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.9522630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.9522859Z 2025-05-07T20:32:12.9523081Z self = 2025-05-07T20:32:12.9524143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.9525492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43f32200>} 2025-05-07T20:32:12.9526822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.9527840Z context = 2025-05-07T20:32:12.9528122Z 2025-05-07T20:32:12.9528295Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.9528890Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.9529361Z module_map=module_map) 2025-05-07T20:32:12.9529729Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.9530090Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.9530354Z E ^ 2025-05-07T20:32:12.9530821Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.9531342Z 2025-05-07T20:32:12.9531760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.9532261Z 2025-05-07T20:32:13.0697211Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0697696Z self=, 2025-05-07T20:32:13.0698108Z T=1, 2025-05-07T20:32:13.0698300Z D=5120, 2025-05-07T20:32:13.0698532Z scale_ub=None, 2025-05-07T20:32:13.0698761Z contiguous=False, 2025-05-07T20:32:13.0698992Z compiled=False, 2025-05-07T20:32:13.0699208Z ) 2025-05-07T20:32:13.0699540Z self = 2025-05-07T20:32:13.0700043Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.0700310Z 2025-05-07T20:32:13.0700399Z @given( 2025-05-07T20:32:13.0700657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0700994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0701309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0701653Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0701995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0702282Z ) 2025-05-07T20:32:13.0702643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0703091Z def test_silu_mul_quant( 2025-05-07T20:32:13.0703344Z self, 2025-05-07T20:32:13.0703553Z T: int, 2025-05-07T20:32:13.0703766Z D: int, 2025-05-07T20:32:13.0703999Z scale_ub: Optional[float], 2025-05-07T20:32:13.0704277Z contiguous: bool, 2025-05-07T20:32:13.0704533Z compiled: bool, 2025-05-07T20:32:13.0704777Z ) -> None: 2025-05-07T20:32:13.0705000Z torch.manual_seed(2025) 2025-05-07T20:32:13.0705261Z 2025-05-07T20:32:13.0705552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0705896Z 2025-05-07T20:32:13.0706097Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0706399Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0706717Z x = x_sign * x_clamp 2025-05-07T20:32:13.0706994Z x0 = x[:, :D] 2025-05-07T20:32:13.0707248Z x1 = x[:, D:] 2025-05-07T20:32:13.0707458Z 2025-05-07T20:32:13.0707654Z if contiguous: 2025-05-07T20:32:13.0707894Z x0 = x0.contiguous() 2025-05-07T20:32:13.0708171Z x1 = x1.contiguous() 2025-05-07T20:32:13.0708422Z 2025-05-07T20:32:13.0708627Z if scale_ub is not None: 2025-05-07T20:32:13.0708915Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0709255Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0709575Z ) 2025-05-07T20:32:13.0709777Z else: 2025-05-07T20:32:13.0709996Z scale_ub_tensor = None 2025-05-07T20:32:13.0710274Z 2025-05-07T20:32:13.0710519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0710838Z op = silu_mul_quant 2025-05-07T20:32:13.0711104Z if compiled: 2025-05-07T20:32:13.0711367Z op = torch.compile(op) 2025-05-07T20:32:13.0711675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0711970Z 2025-05-07T20:32:13.0712184Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0712354Z 2025-05-07T20:32:13.0712459Z moe/activation_test.py:117: 2025-05-07T20:32:13.0713051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0713402Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0713694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0714383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0715212Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0715756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0716442Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0717120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0717697Z kernel = self.compile( 2025-05-07T20:32:13.0718279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0718934Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0719342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0719571Z 2025-05-07T20:32:13.0719790Z self = 2025-05-07T20:32:13.0720865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0722241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a581c39c0>} 2025-05-07T20:32:13.0723583Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0724602Z context = 2025-05-07T20:32:13.0724891Z 2025-05-07T20:32:13.0725067Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0725586Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0726068Z module_map=module_map) 2025-05-07T20:32:13.0726446Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0726813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0727071Z E ^ 2025-05-07T20:32:13.0727547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0728049Z 2025-05-07T20:32:13.0728479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0728986Z 2025-05-07T20:32:13.0729098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0729511Z self=, 2025-05-07T20:32:13.0729919Z T=4096, 2025-05-07T20:32:13.0730120Z D=7168, 2025-05-07T20:32:13.0730314Z scale_ub=1200.0, 2025-05-07T20:32:13.0730556Z contiguous=False, 2025-05-07T20:32:13.0730793Z compiled=False, 2025-05-07T20:32:13.0731006Z ) 2025-05-07T20:32:13.0731332Z self = 2025-05-07T20:32:13.0731836Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.0732110Z 2025-05-07T20:32:13.0732219Z @given( 2025-05-07T20:32:13.0732456Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.0732784Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.0733203Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.0733634Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.0733975Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.0734275Z ) 2025-05-07T20:32:13.0734639Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.0735082Z def test_silu_mul_quant( 2025-05-07T20:32:13.0735341Z self, 2025-05-07T20:32:13.0735555Z T: int, 2025-05-07T20:32:13.0735838Z D: int, 2025-05-07T20:32:13.0736072Z scale_ub: Optional[float], 2025-05-07T20:32:13.0736357Z contiguous: bool, 2025-05-07T20:32:13.0736602Z compiled: bool, 2025-05-07T20:32:13.0736838Z ) -> None: 2025-05-07T20:32:13.0737065Z torch.manual_seed(2025) 2025-05-07T20:32:13.0737310Z 2025-05-07T20:32:13.0737593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.0737952Z 2025-05-07T20:32:13.0738152Z x_sign = torch.sign(x) 2025-05-07T20:32:13.0738462Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.0738786Z x = x_sign * x_clamp 2025-05-07T20:32:13.0739042Z x0 = x[:, :D] 2025-05-07T20:32:13.0739272Z x1 = x[:, D:] 2025-05-07T20:32:13.0739495Z 2025-05-07T20:32:13.0739694Z if contiguous: 2025-05-07T20:32:13.0739932Z x0 = x0.contiguous() 2025-05-07T20:32:13.0740204Z x1 = x1.contiguous() 2025-05-07T20:32:13.0740468Z 2025-05-07T20:32:13.0740671Z if scale_ub is not None: 2025-05-07T20:32:13.0740962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.0741303Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.0741614Z ) 2025-05-07T20:32:13.0741812Z else: 2025-05-07T20:32:13.0742039Z scale_ub_tensor = None 2025-05-07T20:32:13.0742297Z 2025-05-07T20:32:13.0742540Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.0742862Z op = silu_mul_quant 2025-05-07T20:32:13.0743131Z if compiled: 2025-05-07T20:32:13.0743393Z op = torch.compile(op) 2025-05-07T20:32:13.0743705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0743987Z 2025-05-07T20:32:13.0744183Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.0744355Z 2025-05-07T20:32:13.0744459Z moe/activation_test.py:117: 2025-05-07T20:32:13.0744764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0745099Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.0745387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.0746078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.0746780Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.0747371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.0748061Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.0748733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.0749273Z kernel = self.compile( 2025-05-07T20:32:13.0749819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.0750485Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.0750910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.0751139Z 2025-05-07T20:32:13.0751347Z self = 2025-05-07T20:32:13.0752417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.0753886Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a583a49a0>} 2025-05-07T20:32:13.0755236Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.0756351Z context = 2025-05-07T20:32:13.0756637Z 2025-05-07T20:32:13.0764980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.0765548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.0766028Z module_map=module_map) 2025-05-07T20:32:13.0766393Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.0766752Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.0767027Z E ^ 2025-05-07T20:32:13.0767487Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.0767940Z 2025-05-07T20:32:13.0768363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.0768882Z 2025-05-07T20:32:13.0768988Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.0769416Z self=, 2025-05-07T20:32:13.0769835Z T=16384, 2025-05-07T20:32:13.0770073Z D=7168, 2025-05-07T20:32:13.0770278Z scale_ub=None, 2025-05-07T20:32:13.0770498Z contiguous=True, 2025-05-07T20:32:13.0770732Z compiled=True, 2025-05-07T20:32:13.0770952Z ) 2025-05-07T20:32:13.2540400Z self = 2025-05-07T20:32:13.2540993Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.2541268Z 2025-05-07T20:32:13.2541356Z @given( 2025-05-07T20:32:13.2541591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.2541917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.2542232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.2542564Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.2542905Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.2543212Z ) 2025-05-07T20:32:13.2543558Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.2544006Z def test_silu_mul_quant( 2025-05-07T20:32:13.2544257Z self, 2025-05-07T20:32:13.2544452Z T: int, 2025-05-07T20:32:13.2544658Z D: int, 2025-05-07T20:32:13.2544889Z scale_ub: Optional[float], 2025-05-07T20:32:13.2545170Z contiguous: bool, 2025-05-07T20:32:13.2545413Z compiled: bool, 2025-05-07T20:32:13.2545655Z ) -> None: 2025-05-07T20:32:13.2545879Z torch.manual_seed(2025) 2025-05-07T20:32:13.2546123Z 2025-05-07T20:32:13.2546407Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.2546753Z 2025-05-07T20:32:13.2546948Z x_sign = torch.sign(x) 2025-05-07T20:32:13.2547247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.2547562Z x = x_sign * x_clamp 2025-05-07T20:32:13.2547804Z x0 = x[:, :D] 2025-05-07T20:32:13.2548033Z x1 = x[:, D:] 2025-05-07T20:32:13.2548250Z 2025-05-07T20:32:13.2548434Z if contiguous: 2025-05-07T20:32:13.2548672Z x0 = x0.contiguous() 2025-05-07T20:32:13.2548934Z x1 = x1.contiguous() 2025-05-07T20:32:13.2549174Z 2025-05-07T20:32:13.2549374Z if scale_ub is not None: 2025-05-07T20:32:13.2549651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.2549994Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.2550647Z ) 2025-05-07T20:32:13.2550852Z else: 2025-05-07T20:32:13.2551077Z scale_ub_tensor = None 2025-05-07T20:32:13.2551341Z 2025-05-07T20:32:13.2551587Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.2551913Z op = silu_mul_quant 2025-05-07T20:32:13.2552166Z if compiled: 2025-05-07T20:32:13.2552425Z op = torch.compile(op) 2025-05-07T20:32:13.2552884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.2553155Z 2025-05-07T20:32:13.2553361Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.2553524Z 2025-05-07T20:32:13.2553636Z moe/activation_test.py:117: 2025-05-07T20:32:13.2553931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.2554272Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.2554563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.2555137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.2555695Z return fn(*args, **kwargs) 2025-05-07T20:32:13.2556360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.2557051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.2557582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.2558269Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.2558933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.2559705Z kernel = self.compile( 2025-05-07T20:32:13.2560241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.2560903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.2561308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.2561538Z 2025-05-07T20:32:13.2561753Z self = 2025-05-07T20:32:13.2562820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.2564204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59351da0>} 2025-05-07T20:32:13.2565535Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.2566557Z context = 2025-05-07T20:32:13.2566841Z 2025-05-07T20:32:13.2567008Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.2567528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.2567996Z module_map=module_map) 2025-05-07T20:32:13.2568367Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.2568721Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.2568991Z E ^ 2025-05-07T20:32:13.2569457Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.2569900Z 2025-05-07T20:32:13.2570311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.2570823Z 2025-05-07T20:32:13.2570929Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.2571472Z self=, 2025-05-07T20:32:13.2571876Z T=4096, 2025-05-07T20:32:13.2572063Z D=5120, 2025-05-07T20:32:13.2572265Z scale_ub=None, 2025-05-07T20:32:13.2572495Z contiguous=False, 2025-05-07T20:32:13.2572720Z compiled=True, 2025-05-07T20:32:13.2572931Z ) 2025-05-07T20:32:13.2573334Z self = 2025-05-07T20:32:13.2573941Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.2574214Z 2025-05-07T20:32:13.2574296Z @given( 2025-05-07T20:32:13.2574533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.2574848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.2575154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.2575490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.2575824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.2576113Z ) 2025-05-07T20:32:13.2576469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.2576922Z def test_silu_mul_quant( 2025-05-07T20:32:13.2577164Z self, 2025-05-07T20:32:13.2577376Z T: int, 2025-05-07T20:32:13.2577587Z D: int, 2025-05-07T20:32:13.2577813Z scale_ub: Optional[float], 2025-05-07T20:32:13.2578098Z contiguous: bool, 2025-05-07T20:32:13.2578364Z compiled: bool, 2025-05-07T20:32:13.2578604Z ) -> None: 2025-05-07T20:32:13.2578838Z torch.manual_seed(2025) 2025-05-07T20:32:13.2579095Z 2025-05-07T20:32:13.2579383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.2579728Z 2025-05-07T20:32:13.2579937Z x_sign = torch.sign(x) 2025-05-07T20:32:13.2580244Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.2580564Z x = x_sign * x_clamp 2025-05-07T20:32:13.2580826Z x0 = x[:, :D] 2025-05-07T20:32:13.2581059Z x1 = x[:, D:] 2025-05-07T20:32:13.2581276Z 2025-05-07T20:32:13.2581477Z if contiguous: 2025-05-07T20:32:13.2581725Z x0 = x0.contiguous() 2025-05-07T20:32:13.2581996Z x1 = x1.contiguous() 2025-05-07T20:32:13.2582247Z 2025-05-07T20:32:13.2582445Z if scale_ub is not None: 2025-05-07T20:32:13.2582725Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.2583073Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.2583396Z ) 2025-05-07T20:32:13.2583590Z else: 2025-05-07T20:32:13.2583812Z scale_ub_tensor = None 2025-05-07T20:32:13.2584075Z 2025-05-07T20:32:13.2584318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.2584632Z op = silu_mul_quant 2025-05-07T20:32:13.2584893Z if compiled: 2025-05-07T20:32:13.2585153Z op = torch.compile(op) 2025-05-07T20:32:13.2585457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.2585743Z 2025-05-07T20:32:13.2585945Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.2586111Z 2025-05-07T20:32:13.2586213Z moe/activation_test.py:117: 2025-05-07T20:32:13.2586515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.2586849Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.2587131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.2587697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.2588260Z return fn(*args, **kwargs) 2025-05-07T20:32:13.2588930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.2589608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.2590229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.2590910Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.2591565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.2592100Z kernel = self.compile( 2025-05-07T20:32:13.2592646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.2593373Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.2593766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.2594000Z 2025-05-07T20:32:13.2594206Z self = 2025-05-07T20:32:13.2595279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.2596633Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59353b00>} 2025-05-07T20:32:13.2597958Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.2598971Z context = 2025-05-07T20:32:13.2599263Z 2025-05-07T20:32:13.2599430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.2599949Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.2600412Z module_map=module_map) 2025-05-07T20:32:13.2600781Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.2601140Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.2601410Z E ^ 2025-05-07T20:32:13.2601866Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.2602314Z 2025-05-07T20:32:13.2602726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.2603236Z 2025-05-07T20:32:13.4074692Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.4075162Z self=, 2025-05-07T20:32:13.4075564Z T=4096, 2025-05-07T20:32:13.4075763Z D=5120, 2025-05-07T20:32:13.4075965Z scale_ub=1200.0, 2025-05-07T20:32:13.4076183Z contiguous=False, 2025-05-07T20:32:13.4076414Z compiled=False, 2025-05-07T20:32:13.4076646Z ) 2025-05-07T20:32:13.4076959Z self = 2025-05-07T20:32:13.4077475Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.4077751Z 2025-05-07T20:32:13.4077847Z @given( 2025-05-07T20:32:13.4078085Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.4078394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.4078706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.4079043Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.4079380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.4079674Z ) 2025-05-07T20:32:13.4080031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.4080468Z def test_silu_mul_quant( 2025-05-07T20:32:13.4080720Z self, 2025-05-07T20:32:13.4080929Z T: int, 2025-05-07T20:32:13.4081125Z D: int, 2025-05-07T20:32:13.4081354Z scale_ub: Optional[float], 2025-05-07T20:32:13.4081638Z contiguous: bool, 2025-05-07T20:32:13.4082195Z compiled: bool, 2025-05-07T20:32:13.4082430Z ) -> None: 2025-05-07T20:32:13.4082652Z torch.manual_seed(2025) 2025-05-07T20:32:13.4082897Z 2025-05-07T20:32:13.4083162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.4083516Z 2025-05-07T20:32:13.4083716Z x_sign = torch.sign(x) 2025-05-07T20:32:13.4084007Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.4084500Z x = x_sign * x_clamp 2025-05-07T20:32:13.4084751Z x0 = x[:, :D] 2025-05-07T20:32:13.4084973Z x1 = x[:, D:] 2025-05-07T20:32:13.4085190Z 2025-05-07T20:32:13.4085386Z if contiguous: 2025-05-07T20:32:13.4085617Z x0 = x0.contiguous() 2025-05-07T20:32:13.4085882Z x1 = x1.contiguous() 2025-05-07T20:32:13.4086128Z 2025-05-07T20:32:13.4086321Z if scale_ub is not None: 2025-05-07T20:32:13.4086601Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.4086946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.4087277Z ) 2025-05-07T20:32:13.4087506Z else: 2025-05-07T20:32:13.4087722Z scale_ub_tensor = None 2025-05-07T20:32:13.4087970Z 2025-05-07T20:32:13.4088212Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4088528Z op = silu_mul_quant 2025-05-07T20:32:13.4088784Z if compiled: 2025-05-07T20:32:13.4089044Z op = torch.compile(op) 2025-05-07T20:32:13.4089344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4089628Z 2025-05-07T20:32:13.4089820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.4089991Z 2025-05-07T20:32:13.4090091Z moe/activation_test.py:117: 2025-05-07T20:32:13.4090423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4090759Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.4091041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4091743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.4092436Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.4093065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.4093743Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.4094422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.4094956Z kernel = self.compile( 2025-05-07T20:32:13.4095500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.4096160Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.4096559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4096794Z 2025-05-07T20:32:13.4097024Z self = 2025-05-07T20:32:13.4098117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.4099484Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59c779c0>} 2025-05-07T20:32:13.4100814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.4101828Z context = 2025-05-07T20:32:13.4102113Z 2025-05-07T20:32:13.4102375Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.4102891Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.4103364Z module_map=module_map) 2025-05-07T20:32:13.4103735Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.4104084Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.4104428Z E ^ 2025-05-07T20:32:13.4104893Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.4105335Z 2025-05-07T20:32:13.4105756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.4106260Z 2025-05-07T20:32:13.4106363Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.4106787Z self=, 2025-05-07T20:32:13.4107219Z T=4096, 2025-05-07T20:32:13.4107433Z D=5120, 2025-05-07T20:32:13.4107631Z scale_ub=1200.0, 2025-05-07T20:32:13.4107858Z contiguous=False, 2025-05-07T20:32:13.4108080Z compiled=True, 2025-05-07T20:32:13.4108297Z ) 2025-05-07T20:32:13.4108616Z self = 2025-05-07T20:32:13.4109112Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.4109389Z 2025-05-07T20:32:13.4109470Z @given( 2025-05-07T20:32:13.4109709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.4110026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.4110332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.4110673Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.4111004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.4111297Z ) 2025-05-07T20:32:13.4111665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.4112107Z def test_silu_mul_quant( 2025-05-07T20:32:13.4112351Z self, 2025-05-07T20:32:13.4112544Z T: int, 2025-05-07T20:32:13.4112745Z D: int, 2025-05-07T20:32:13.4112969Z scale_ub: Optional[float], 2025-05-07T20:32:13.4113232Z contiguous: bool, 2025-05-07T20:32:13.4113474Z compiled: bool, 2025-05-07T20:32:13.4113698Z ) -> None: 2025-05-07T20:32:13.4113918Z torch.manual_seed(2025) 2025-05-07T20:32:13.4114170Z 2025-05-07T20:32:13.4114448Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.4114786Z 2025-05-07T20:32:13.4114991Z x_sign = torch.sign(x) 2025-05-07T20:32:13.4115287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.4115595Z x = x_sign * x_clamp 2025-05-07T20:32:13.4115844Z x0 = x[:, :D] 2025-05-07T20:32:13.4116065Z x1 = x[:, D:] 2025-05-07T20:32:13.4116273Z 2025-05-07T20:32:13.4116466Z if contiguous: 2025-05-07T20:32:13.4116703Z x0 = x0.contiguous() 2025-05-07T20:32:13.4116962Z x1 = x1.contiguous() 2025-05-07T20:32:13.4117213Z 2025-05-07T20:32:13.4117419Z if scale_ub is not None: 2025-05-07T20:32:13.4117697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.4118026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.4118337Z ) 2025-05-07T20:32:13.4118534Z else: 2025-05-07T20:32:13.4118740Z scale_ub_tensor = None 2025-05-07T20:32:13.4118992Z 2025-05-07T20:32:13.4119227Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4119536Z op = silu_mul_quant 2025-05-07T20:32:13.4119789Z if compiled: 2025-05-07T20:32:13.4120045Z op = torch.compile(op) 2025-05-07T20:32:13.4120337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4120613Z 2025-05-07T20:32:13.4120894Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.4121061Z 2025-05-07T20:32:13.4121160Z moe/activation_test.py:117: 2025-05-07T20:32:13.4121459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4121788Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.4122071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4122622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.4123262Z return fn(*args, **kwargs) 2025-05-07T20:32:13.4123917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.4124589Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.4125122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.4125795Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.4126455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.4126974Z kernel = self.compile( 2025-05-07T20:32:13.4127511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.4128209Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.4128612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4128838Z 2025-05-07T20:32:13.4129044Z self = 2025-05-07T20:32:13.4130107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.4131461Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5985d8a0>} 2025-05-07T20:32:13.4132778Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.4133891Z context = 2025-05-07T20:32:13.4134186Z 2025-05-07T20:32:13.4134352Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.4134871Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.4135344Z module_map=module_map) 2025-05-07T20:32:13.4135707Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.4136062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.4136323Z E ^ 2025-05-07T20:32:13.4136786Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.4137264Z 2025-05-07T20:32:13.4137701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.4138215Z 2025-05-07T20:32:13.5289742Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5290191Z self=, 2025-05-07T20:32:13.5290634Z T=2048, 2025-05-07T20:32:13.5290826Z D=7168, 2025-05-07T20:32:13.5291027Z scale_ub=1200.0, 2025-05-07T20:32:13.5291252Z contiguous=False, 2025-05-07T20:32:13.5291484Z compiled=False, 2025-05-07T20:32:13.5291690Z ) 2025-05-07T20:32:13.5292006Z self = 2025-05-07T20:32:13.5292502Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:13.5292773Z 2025-05-07T20:32:13.5293236Z @given( 2025-05-07T20:32:13.5293473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5293790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5294090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5294418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5294746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5295177Z ) 2025-05-07T20:32:13.5295524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5295962Z def test_silu_mul_quant( 2025-05-07T20:32:13.5296208Z self, 2025-05-07T20:32:13.5296397Z T: int, 2025-05-07T20:32:13.5296604Z D: int, 2025-05-07T20:32:13.5296825Z scale_ub: Optional[float], 2025-05-07T20:32:13.5297091Z contiguous: bool, 2025-05-07T20:32:13.5297334Z compiled: bool, 2025-05-07T20:32:13.5297564Z ) -> None: 2025-05-07T20:32:13.5297786Z torch.manual_seed(2025) 2025-05-07T20:32:13.5298027Z 2025-05-07T20:32:13.5298297Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5298640Z 2025-05-07T20:32:13.5298838Z x_sign = torch.sign(x) 2025-05-07T20:32:13.5299133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.5299441Z x = x_sign * x_clamp 2025-05-07T20:32:13.5299705Z x0 = x[:, :D] 2025-05-07T20:32:13.5299937Z x1 = x[:, D:] 2025-05-07T20:32:13.5300149Z 2025-05-07T20:32:13.5300347Z if contiguous: 2025-05-07T20:32:13.5300589Z x0 = x0.contiguous() 2025-05-07T20:32:13.5300857Z x1 = x1.contiguous() 2025-05-07T20:32:13.5310111Z 2025-05-07T20:32:13.5310341Z if scale_ub is not None: 2025-05-07T20:32:13.5310639Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.5310981Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.5311299Z ) 2025-05-07T20:32:13.5311512Z else: 2025-05-07T20:32:13.5311722Z scale_ub_tensor = None 2025-05-07T20:32:13.5311990Z 2025-05-07T20:32:13.5312235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.5312556Z op = silu_mul_quant 2025-05-07T20:32:13.5312805Z if compiled: 2025-05-07T20:32:13.5313058Z op = torch.compile(op) 2025-05-07T20:32:13.5313366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5313646Z 2025-05-07T20:32:13.5313857Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.5314022Z 2025-05-07T20:32:13.5314130Z moe/activation_test.py:117: 2025-05-07T20:32:13.5314426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5314763Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.5315051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5315741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.5316430Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.5316964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.5317644Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.5318301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.5318830Z kernel = self.compile( 2025-05-07T20:32:13.5319373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.5320024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.5320420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5320654Z 2025-05-07T20:32:13.5320864Z self = 2025-05-07T20:32:13.5322069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.5323435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5a2ecc20>} 2025-05-07T20:32:13.5324828Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.5325839Z context = 2025-05-07T20:32:13.5326136Z 2025-05-07T20:32:13.5326306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.5326831Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.5327295Z module_map=module_map) 2025-05-07T20:32:13.5327668Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.5328022Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.5328278Z E ^ 2025-05-07T20:32:13.5328742Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.5329200Z 2025-05-07T20:32:13.5329609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.5330109Z 2025-05-07T20:32:13.5330221Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5330626Z self=, 2025-05-07T20:32:13.5331032Z T=1, 2025-05-07T20:32:13.5331228Z D=7168, 2025-05-07T20:32:13.5331420Z scale_ub=None, 2025-05-07T20:32:13.5331647Z contiguous=True, 2025-05-07T20:32:13.5331904Z compiled=False, 2025-05-07T20:32:13.5332116Z ) 2025-05-07T20:32:13.5332436Z self = 2025-05-07T20:32:13.5332928Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:13.5333277Z 2025-05-07T20:32:13.5333362Z @given( 2025-05-07T20:32:13.5333599Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.5333913Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.5334226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.5334565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.5334890Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.5335183Z ) 2025-05-07T20:32:13.5335531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.5335968Z def test_silu_mul_quant( 2025-05-07T20:32:13.5336218Z self, 2025-05-07T20:32:13.5336432Z T: int, 2025-05-07T20:32:13.5336641Z D: int, 2025-05-07T20:32:13.5336862Z scale_ub: Optional[float], 2025-05-07T20:32:13.5337146Z contiguous: bool, 2025-05-07T20:32:13.5337399Z compiled: bool, 2025-05-07T20:32:13.5337619Z ) -> None: 2025-05-07T20:32:13.5337850Z torch.manual_seed(2025) 2025-05-07T20:32:13.5338105Z 2025-05-07T20:32:13.5338372Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.5338735Z 2025-05-07T20:32:13.5338940Z x_sign = torch.sign(x) 2025-05-07T20:32:13.5339232Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.5339551Z x = x_sign * x_clamp 2025-05-07T20:32:13.5339805Z x0 = x[:, :D] 2025-05-07T20:32:13.5340027Z x1 = x[:, D:] 2025-05-07T20:32:13.5340243Z 2025-05-07T20:32:13.5340435Z if contiguous: 2025-05-07T20:32:13.5340663Z x0 = x0.contiguous() 2025-05-07T20:32:13.5341016Z x1 = x1.contiguous() 2025-05-07T20:32:13.5341266Z 2025-05-07T20:32:13.5341458Z if scale_ub is not None: 2025-05-07T20:32:13.5341740Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.5342086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.5342408Z ) 2025-05-07T20:32:13.5342602Z else: 2025-05-07T20:32:13.5342821Z scale_ub_tensor = None 2025-05-07T20:32:13.5343157Z 2025-05-07T20:32:13.5343385Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.5343706Z op = silu_mul_quant 2025-05-07T20:32:13.5343957Z if compiled: 2025-05-07T20:32:13.5344202Z op = torch.compile(op) 2025-05-07T20:32:13.5344504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5344788Z 2025-05-07T20:32:13.5344980Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.5345154Z 2025-05-07T20:32:13.5345257Z moe/activation_test.py:117: 2025-05-07T20:32:13.5345565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5345899Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.5346176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.5346870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.5347560Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.5348102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.5348789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.5349463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.5350003Z kernel = self.compile( 2025-05-07T20:32:13.5350547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.5351203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.5351608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.5351841Z 2025-05-07T20:32:13.5352047Z self = 2025-05-07T20:32:13.5353120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.5354482Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5a2ee700>} 2025-05-07T20:32:13.5355816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.5356832Z context = 2025-05-07T20:32:13.5357120Z 2025-05-07T20:32:13.5357285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.5357811Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.5358321Z module_map=module_map) 2025-05-07T20:32:13.5358730Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.5359093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.5359674Z E ^ 2025-05-07T20:32:13.5360155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.5360602Z 2025-05-07T20:32:13.5361019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.5361535Z 2025-05-07T20:32:13.5361798Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.5362223Z self=, 2025-05-07T20:32:13.5362640Z T=16384, 2025-05-07T20:32:13.5362850Z D=7168, 2025-05-07T20:32:13.5363060Z scale_ub=1200.0, 2025-05-07T20:32:13.5363294Z contiguous=False, 2025-05-07T20:32:13.5363528Z compiled=True, 2025-05-07T20:32:13.7763785Z ) 2025-05-07T20:32:13.7764563Z self = 2025-05-07T20:32:13.7765084Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:13.7765365Z 2025-05-07T20:32:13.7765449Z @given( 2025-05-07T20:32:13.7765696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.7766021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.7766332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.7766700Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.7767037Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.7767320Z ) 2025-05-07T20:32:13.7767671Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.7768118Z def test_silu_mul_quant( 2025-05-07T20:32:13.7768363Z self, 2025-05-07T20:32:13.7768571Z T: int, 2025-05-07T20:32:13.7768780Z D: int, 2025-05-07T20:32:13.7769003Z scale_ub: Optional[float], 2025-05-07T20:32:13.7769281Z contiguous: bool, 2025-05-07T20:32:13.7769555Z compiled: bool, 2025-05-07T20:32:13.7769788Z ) -> None: 2025-05-07T20:32:13.7770012Z torch.manual_seed(2025) 2025-05-07T20:32:13.7770260Z 2025-05-07T20:32:13.7770535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.7770885Z 2025-05-07T20:32:13.7771077Z x_sign = torch.sign(x) 2025-05-07T20:32:13.7771369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.7771687Z x = x_sign * x_clamp 2025-05-07T20:32:13.7771932Z x0 = x[:, :D] 2025-05-07T20:32:13.7772160Z x1 = x[:, D:] 2025-05-07T20:32:13.7772377Z 2025-05-07T20:32:13.7772571Z if contiguous: 2025-05-07T20:32:13.7772824Z x0 = x0.contiguous() 2025-05-07T20:32:13.7773166Z x1 = x1.contiguous() 2025-05-07T20:32:13.7773420Z 2025-05-07T20:32:13.7773620Z if scale_ub is not None: 2025-05-07T20:32:13.7773909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.7774246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.7774549Z ) 2025-05-07T20:32:13.7774749Z else: 2025-05-07T20:32:13.7774981Z scale_ub_tensor = None 2025-05-07T20:32:13.7775235Z 2025-05-07T20:32:13.7775472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.7775790Z op = silu_mul_quant 2025-05-07T20:32:13.7776038Z if compiled: 2025-05-07T20:32:13.7776301Z op = torch.compile(op) 2025-05-07T20:32:13.7776607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.7776882Z 2025-05-07T20:32:13.7777085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.7777275Z 2025-05-07T20:32:13.7777396Z moe/activation_test.py:117: 2025-05-07T20:32:13.7777702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.7778039Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.7778326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.7778890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.7779448Z return fn(*args, **kwargs) 2025-05-07T20:32:13.7780108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.7780792Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.7781484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.7782162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.7782824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.7783354Z kernel = self.compile( 2025-05-07T20:32:13.7783996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.7784649Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.7785048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.7785274Z 2025-05-07T20:32:13.7785479Z self = 2025-05-07T20:32:13.7786559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.7787990Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a700c2340>} 2025-05-07T20:32:13.7789316Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.7790333Z context = 2025-05-07T20:32:13.7790618Z 2025-05-07T20:32:13.7790784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.7791305Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.7791775Z module_map=module_map) 2025-05-07T20:32:13.7792147Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.7792499Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.7792765Z E ^ 2025-05-07T20:32:13.7793236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.7793681Z 2025-05-07T20:32:13.7794095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.7794615Z 2025-05-07T20:32:13.7794722Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.7795140Z self=, 2025-05-07T20:32:13.7795538Z T=1, 2025-05-07T20:32:13.7795726Z D=7168, 2025-05-07T20:32:13.7795936Z scale_ub=None, 2025-05-07T20:32:13.7796164Z contiguous=False, 2025-05-07T20:32:13.7796393Z compiled=False, 2025-05-07T20:32:13.7796606Z ) 2025-05-07T20:32:13.7796935Z self = 2025-05-07T20:32:13.7797415Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:13.7797682Z 2025-05-07T20:32:13.7797766Z @given( 2025-05-07T20:32:13.7798001Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.7798316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.7798621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.7798961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.7799297Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.7799580Z ) 2025-05-07T20:32:13.7799937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.7800380Z def test_silu_mul_quant( 2025-05-07T20:32:13.7800619Z self, 2025-05-07T20:32:13.7800826Z T: int, 2025-05-07T20:32:13.7801037Z D: int, 2025-05-07T20:32:13.7801341Z scale_ub: Optional[float], 2025-05-07T20:32:13.7801620Z contiguous: bool, 2025-05-07T20:32:13.7801871Z compiled: bool, 2025-05-07T20:32:13.7802096Z ) -> None: 2025-05-07T20:32:13.7802322Z torch.manual_seed(2025) 2025-05-07T20:32:13.7802579Z 2025-05-07T20:32:13.7802850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.7803195Z 2025-05-07T20:32:13.7803396Z x_sign = torch.sign(x) 2025-05-07T20:32:13.7803765Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.7804073Z x = x_sign * x_clamp 2025-05-07T20:32:13.7804321Z x0 = x[:, :D] 2025-05-07T20:32:13.7804546Z x1 = x[:, D:] 2025-05-07T20:32:13.7804754Z 2025-05-07T20:32:13.7804943Z if contiguous: 2025-05-07T20:32:13.7805182Z x0 = x0.contiguous() 2025-05-07T20:32:13.7805438Z x1 = x1.contiguous() 2025-05-07T20:32:13.7805686Z 2025-05-07T20:32:13.7805884Z if scale_ub is not None: 2025-05-07T20:32:13.7806165Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.7806504Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.7806817Z ) 2025-05-07T20:32:13.7807014Z else: 2025-05-07T20:32:13.7807233Z scale_ub_tensor = None 2025-05-07T20:32:13.7807489Z 2025-05-07T20:32:13.7807719Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.7808043Z op = silu_mul_quant 2025-05-07T20:32:13.7808297Z if compiled: 2025-05-07T20:32:13.7808551Z op = torch.compile(op) 2025-05-07T20:32:13.7808844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.7809125Z 2025-05-07T20:32:13.7809327Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.7809491Z 2025-05-07T20:32:13.7809591Z moe/activation_test.py:117: 2025-05-07T20:32:13.7809892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.7810228Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.7810512Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.7811199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.7811881Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.7812423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.7813160Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.7813821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.7814354Z kernel = self.compile( 2025-05-07T20:32:13.7814892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.7815547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.7815959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.7816191Z 2025-05-07T20:32:13.7816414Z self = 2025-05-07T20:32:13.7817479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.7818846Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a703bbc40>} 2025-05-07T20:32:13.7820179Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.7821274Z context = 2025-05-07T20:32:13.7821561Z 2025-05-07T20:32:13.7821735Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.7822248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.7822716Z module_map=module_map) 2025-05-07T20:32:13.7823082Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.7823508Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.7823771Z E ^ 2025-05-07T20:32:13.7824235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.7824677Z 2025-05-07T20:32:13.7825093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.7825596Z 2025-05-07T20:32:13.7825701Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.7826124Z self=, 2025-05-07T20:32:13.7826521Z T=2048, 2025-05-07T20:32:13.7826710Z D=7168, 2025-05-07T20:32:13.7826906Z scale_ub=None, 2025-05-07T20:32:13.7827128Z contiguous=False, 2025-05-07T20:32:13.7827349Z compiled=True, 2025-05-07T20:32:13.7827559Z ) 2025-05-07T20:32:13.8694904Z self = 2025-05-07T20:32:13.8695420Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.8695710Z 2025-05-07T20:32:13.8695787Z @given( 2025-05-07T20:32:13.8696021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8696341Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8696640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8696978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8697339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8697654Z ) 2025-05-07T20:32:13.8698006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8698451Z def test_silu_mul_quant( 2025-05-07T20:32:13.8698702Z self, 2025-05-07T20:32:13.8698895Z T: int, 2025-05-07T20:32:13.8699101Z D: int, 2025-05-07T20:32:13.8699327Z scale_ub: Optional[float], 2025-05-07T20:32:13.8699598Z contiguous: bool, 2025-05-07T20:32:13.8699841Z compiled: bool, 2025-05-07T20:32:13.8700076Z ) -> None: 2025-05-07T20:32:13.8700290Z torch.manual_seed(2025) 2025-05-07T20:32:13.8700538Z 2025-05-07T20:32:13.8700814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8701154Z 2025-05-07T20:32:13.8701392Z x_sign = torch.sign(x) 2025-05-07T20:32:13.8701684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.8702007Z x = x_sign * x_clamp 2025-05-07T20:32:13.8702254Z x0 = x[:, :D] 2025-05-07T20:32:13.8702473Z x1 = x[:, D:] 2025-05-07T20:32:13.8702686Z 2025-05-07T20:32:13.8702878Z if contiguous: 2025-05-07T20:32:13.8703106Z x0 = x0.contiguous() 2025-05-07T20:32:13.8703378Z x1 = x1.contiguous() 2025-05-07T20:32:13.8703622Z 2025-05-07T20:32:13.8703821Z if scale_ub is not None: 2025-05-07T20:32:13.8704097Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.8704430Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.8704749Z ) 2025-05-07T20:32:13.8704943Z else: 2025-05-07T20:32:13.8705157Z scale_ub_tensor = None 2025-05-07T20:32:13.8705412Z 2025-05-07T20:32:13.8705647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.8705963Z op = silu_mul_quant 2025-05-07T20:32:13.8706219Z if compiled: 2025-05-07T20:32:13.8706464Z op = torch.compile(op) 2025-05-07T20:32:13.8706769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8707453Z 2025-05-07T20:32:13.8707649Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.8707820Z 2025-05-07T20:32:13.8707918Z moe/activation_test.py:117: 2025-05-07T20:32:13.8708226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8708559Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.8708838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8709537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.8710098Z return fn(*args, **kwargs) 2025-05-07T20:32:13.8710747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.8711428Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.8711966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.8712646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.8713302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.8713834Z kernel = self.compile( 2025-05-07T20:32:13.8714374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.8715024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.8715427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8715662Z 2025-05-07T20:32:13.8715869Z self = 2025-05-07T20:32:13.8716959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.8718321Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a9c629800>} 2025-05-07T20:32:13.8719638Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.8720652Z context = 2025-05-07T20:32:13.8720940Z 2025-05-07T20:32:13.8721113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.8721632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.8722093Z module_map=module_map) 2025-05-07T20:32:13.8722461Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.8722828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.8731566Z E ^ 2025-05-07T20:32:13.8732077Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.8732537Z 2025-05-07T20:32:13.8732959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.8733568Z 2025-05-07T20:32:13.8733694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.8734110Z self=, 2025-05-07T20:32:13.8734514Z T=4096, 2025-05-07T20:32:13.8734710Z D=7168, 2025-05-07T20:32:13.8734897Z scale_ub=None, 2025-05-07T20:32:13.8735117Z contiguous=False, 2025-05-07T20:32:13.8735341Z compiled=True, 2025-05-07T20:32:13.8735555Z ) 2025-05-07T20:32:13.8735872Z self = 2025-05-07T20:32:13.8736484Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:13.8736755Z 2025-05-07T20:32:13.8736846Z @given( 2025-05-07T20:32:13.8737082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.8737453Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.8737777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.8738104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.8738516Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.8738805Z ) 2025-05-07T20:32:13.8739166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.8739605Z def test_silu_mul_quant( 2025-05-07T20:32:13.8739852Z self, 2025-05-07T20:32:13.8740060Z T: int, 2025-05-07T20:32:13.8740260Z D: int, 2025-05-07T20:32:13.8740490Z scale_ub: Optional[float], 2025-05-07T20:32:13.8740773Z contiguous: bool, 2025-05-07T20:32:13.8741014Z compiled: bool, 2025-05-07T20:32:13.8741257Z ) -> None: 2025-05-07T20:32:13.8741488Z torch.manual_seed(2025) 2025-05-07T20:32:13.8741724Z 2025-05-07T20:32:13.8742001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.8742351Z 2025-05-07T20:32:13.8742546Z x_sign = torch.sign(x) 2025-05-07T20:32:13.8742846Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.8743166Z x = x_sign * x_clamp 2025-05-07T20:32:13.8743418Z x0 = x[:, :D] 2025-05-07T20:32:13.8743651Z x1 = x[:, D:] 2025-05-07T20:32:13.8743868Z 2025-05-07T20:32:13.8744059Z if contiguous: 2025-05-07T20:32:13.8744298Z x0 = x0.contiguous() 2025-05-07T20:32:13.8744571Z x1 = x1.contiguous() 2025-05-07T20:32:13.8744821Z 2025-05-07T20:32:13.8745014Z if scale_ub is not None: 2025-05-07T20:32:13.8745303Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.8745655Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.8745961Z ) 2025-05-07T20:32:13.8746163Z else: 2025-05-07T20:32:13.8746384Z scale_ub_tensor = None 2025-05-07T20:32:13.8746637Z 2025-05-07T20:32:13.8746884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.8747237Z op = silu_mul_quant 2025-05-07T20:32:13.8747514Z if compiled: 2025-05-07T20:32:13.8747776Z op = torch.compile(op) 2025-05-07T20:32:13.8748088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8748363Z 2025-05-07T20:32:13.8748566Z > y_fp8, y_scale = fn() 2025-05-07T20:32:13.8748731Z 2025-05-07T20:32:13.8748845Z moe/activation_test.py:117: 2025-05-07T20:32:13.8749149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8749481Z moe/activation_test.py:115: in fn 2025-05-07T20:32:13.8749770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.8750336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:13.8750895Z return fn(*args, **kwargs) 2025-05-07T20:32:13.8751560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:13.8752261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:13.8752825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.8753505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.8754171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.8754709Z kernel = self.compile( 2025-05-07T20:32:13.8755250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.8755997Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.8756411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.8756642Z 2025-05-07T20:32:13.8756865Z self = 2025-05-07T20:32:13.8757986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.8759673Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43953880>} 2025-05-07T20:32:13.8761024Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.8762074Z context = 2025-05-07T20:32:13.8762370Z 2025-05-07T20:32:13.8762562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.8763098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.8763577Z module_map=module_map) 2025-05-07T20:32:13.8763969Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.8764331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.8764605Z E ^ 2025-05-07T20:32:13.8765092Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.8765549Z 2025-05-07T20:32:13.8765971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.8766476Z 2025-05-07T20:32:14.0344886Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.0345665Z self=, 2025-05-07T20:32:14.0346250Z T=16384, 2025-05-07T20:32:14.0346541Z D=5120, 2025-05-07T20:32:14.0346831Z scale_ub=1200.0, 2025-05-07T20:32:14.0347073Z contiguous=False, 2025-05-07T20:32:14.0347308Z compiled=False, 2025-05-07T20:32:14.0347541Z ) 2025-05-07T20:32:14.0347877Z self = 2025-05-07T20:32:14.0348404Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.0348704Z 2025-05-07T20:32:14.0348785Z @given( 2025-05-07T20:32:14.0349035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.0349360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.0349672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.0350011Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.0350354Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.0350645Z ) 2025-05-07T20:32:14.0351008Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.0351456Z def test_silu_mul_quant( 2025-05-07T20:32:14.0351706Z self, 2025-05-07T20:32:14.0351918Z T: int, 2025-05-07T20:32:14.0352129Z D: int, 2025-05-07T20:32:14.0352351Z scale_ub: Optional[float], 2025-05-07T20:32:14.0352644Z contiguous: bool, 2025-05-07T20:32:14.0352898Z compiled: bool, 2025-05-07T20:32:14.0353144Z ) -> None: 2025-05-07T20:32:14.0353363Z torch.manual_seed(2025) 2025-05-07T20:32:14.0353615Z 2025-05-07T20:32:14.0353902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.0354249Z 2025-05-07T20:32:14.0354457Z x_sign = torch.sign(x) 2025-05-07T20:32:14.0354755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.0355067Z x = x_sign * x_clamp 2025-05-07T20:32:14.0355571Z x0 = x[:, :D] 2025-05-07T20:32:14.0355803Z x1 = x[:, D:] 2025-05-07T20:32:14.0356010Z 2025-05-07T20:32:14.0356207Z if contiguous: 2025-05-07T20:32:14.0356448Z x0 = x0.contiguous() 2025-05-07T20:32:14.0356710Z x1 = x1.contiguous() 2025-05-07T20:32:14.0356964Z 2025-05-07T20:32:14.0357168Z if scale_ub is not None: 2025-05-07T20:32:14.0357447Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.0357937Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.0358265Z ) 2025-05-07T20:32:14.0358469Z else: 2025-05-07T20:32:14.0358679Z scale_ub_tensor = None 2025-05-07T20:32:14.0358939Z 2025-05-07T20:32:14.0359411Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.0359769Z op = silu_mul_quant 2025-05-07T20:32:14.0360034Z if compiled: 2025-05-07T20:32:14.0360290Z op = torch.compile(op) 2025-05-07T20:32:14.0360596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.0360884Z 2025-05-07T20:32:14.0361087Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.0361255Z 2025-05-07T20:32:14.0361357Z moe/activation_test.py:117: 2025-05-07T20:32:14.0361665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.0362009Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.0362301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.0362998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.0363684Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.0364223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.0364895Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.0365565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.0366097Z kernel = self.compile( 2025-05-07T20:32:14.0366640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.0367290Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.0367680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.0367915Z 2025-05-07T20:32:14.0368126Z self = 2025-05-07T20:32:14.0369197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.0370581Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43952700>} 2025-05-07T20:32:14.0371897Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.0372898Z context = 2025-05-07T20:32:14.0373262Z 2025-05-07T20:32:14.0373428Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.0373943Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.0374408Z module_map=module_map) 2025-05-07T20:32:14.0374766Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.0375120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.0375384Z E ^ 2025-05-07T20:32:14.0375979Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.0376431Z 2025-05-07T20:32:14.0376842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.0377347Z 2025-05-07T20:32:14.0377452Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.0377862Z self=, 2025-05-07T20:32:14.0378369Z T=16384, 2025-05-07T20:32:14.0378568Z D=5120, 2025-05-07T20:32:14.0378765Z scale_ub=1200.0, 2025-05-07T20:32:14.0378987Z contiguous=True, 2025-05-07T20:32:14.0379210Z compiled=True, 2025-05-07T20:32:14.0379418Z ) 2025-05-07T20:32:14.0379731Z self = 2025-05-07T20:32:14.0380228Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.0380506Z 2025-05-07T20:32:14.0380586Z @given( 2025-05-07T20:32:14.0380824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.0381133Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.0381439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.0381784Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.0382107Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.0382401Z ) 2025-05-07T20:32:14.0382755Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.0383198Z def test_silu_mul_quant( 2025-05-07T20:32:14.0383493Z self, 2025-05-07T20:32:14.0383785Z T: int, 2025-05-07T20:32:14.0384099Z D: int, 2025-05-07T20:32:14.0384425Z scale_ub: Optional[float], 2025-05-07T20:32:14.0384816Z contiguous: bool, 2025-05-07T20:32:14.0385171Z compiled: bool, 2025-05-07T20:32:14.0385489Z ) -> None: 2025-05-07T20:32:14.0385790Z torch.manual_seed(2025) 2025-05-07T20:32:14.0386168Z 2025-05-07T20:32:14.0386596Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.0387162Z 2025-05-07T20:32:14.0387468Z x_sign = torch.sign(x) 2025-05-07T20:32:14.0387925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.0388433Z x = x_sign * x_clamp 2025-05-07T20:32:14.0388824Z x0 = x[:, :D] 2025-05-07T20:32:14.0389165Z x1 = x[:, D:] 2025-05-07T20:32:14.0389520Z 2025-05-07T20:32:14.0389824Z if contiguous: 2025-05-07T20:32:14.0390200Z x0 = x0.contiguous() 2025-05-07T20:32:14.0390617Z x1 = x1.contiguous() 2025-05-07T20:32:14.0391073Z 2025-05-07T20:32:14.0391365Z if scale_ub is not None: 2025-05-07T20:32:14.0391769Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.0392235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.0392607Z ) 2025-05-07T20:32:14.0392811Z else: 2025-05-07T20:32:14.0393035Z scale_ub_tensor = None 2025-05-07T20:32:14.0393301Z 2025-05-07T20:32:14.0393546Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.0393856Z op = silu_mul_quant 2025-05-07T20:32:14.0394108Z if compiled: 2025-05-07T20:32:14.0394363Z op = torch.compile(op) 2025-05-07T20:32:14.0394657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.0394939Z 2025-05-07T20:32:14.0395135Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.0395304Z 2025-05-07T20:32:14.0395412Z moe/activation_test.py:117: 2025-05-07T20:32:14.0395706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.0396043Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.0396330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.0396893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.0397583Z return fn(*args, **kwargs) 2025-05-07T20:32:14.0398244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.0398929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.0399454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.0400127Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.0400866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.0401390Z kernel = self.compile( 2025-05-07T20:32:14.0401936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.0402594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.0403005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.0403231Z 2025-05-07T20:32:14.0403440Z self = 2025-05-07T20:32:14.0404513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.0405885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58561e40>} 2025-05-07T20:32:14.0407212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.0408220Z context = 2025-05-07T20:32:14.0408515Z 2025-05-07T20:32:14.0408689Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.0409214Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.0409687Z module_map=module_map) 2025-05-07T20:32:14.0410057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.0410422Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.0410695Z E ^ 2025-05-07T20:32:14.0411162Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.0411618Z 2025-05-07T20:32:14.0412029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.0412538Z 2025-05-07T20:32:14.2115133Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2115792Z self=, 2025-05-07T20:32:14.2116417Z T=16384, 2025-05-07T20:32:14.2116686Z D=5120, 2025-05-07T20:32:14.2116962Z scale_ub=None, 2025-05-07T20:32:14.2117412Z contiguous=False, 2025-05-07T20:32:14.2117876Z compiled=True, 2025-05-07T20:32:14.2118292Z ) 2025-05-07T20:32:14.2118936Z self = 2025-05-07T20:32:14.2119916Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.2120485Z 2025-05-07T20:32:14.2120647Z @given( 2025-05-07T20:32:14.2121118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.2121733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.2122349Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.2123003Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.2123648Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.2124210Z ) 2025-05-07T20:32:14.2125212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.2126096Z def test_silu_mul_quant( 2025-05-07T20:32:14.2126570Z self, 2025-05-07T20:32:14.2126960Z T: int, 2025-05-07T20:32:14.2127341Z D: int, 2025-05-07T20:32:14.2127581Z scale_ub: Optional[float], 2025-05-07T20:32:14.2127846Z contiguous: bool, 2025-05-07T20:32:14.2128085Z compiled: bool, 2025-05-07T20:32:14.2128316Z ) -> None: 2025-05-07T20:32:14.2128675Z torch.manual_seed(2025) 2025-05-07T20:32:14.2128920Z 2025-05-07T20:32:14.2129198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.2129533Z 2025-05-07T20:32:14.2129729Z x_sign = torch.sign(x) 2025-05-07T20:32:14.2130023Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.2130327Z x = x_sign * x_clamp 2025-05-07T20:32:14.2130574Z x0 = x[:, :D] 2025-05-07T20:32:14.2130793Z x1 = x[:, D:] 2025-05-07T20:32:14.2131003Z 2025-05-07T20:32:14.2131195Z if contiguous: 2025-05-07T20:32:14.2131431Z x0 = x0.contiguous() 2025-05-07T20:32:14.2131691Z x1 = x1.contiguous() 2025-05-07T20:32:14.2131924Z 2025-05-07T20:32:14.2132120Z if scale_ub is not None: 2025-05-07T20:32:14.2132395Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.2132720Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.2133103Z ) 2025-05-07T20:32:14.2133300Z else: 2025-05-07T20:32:14.2133507Z scale_ub_tensor = None 2025-05-07T20:32:14.2133762Z 2025-05-07T20:32:14.2133998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.2134302Z op = silu_mul_quant 2025-05-07T20:32:14.2134553Z if compiled: 2025-05-07T20:32:14.2134802Z op = torch.compile(op) 2025-05-07T20:32:14.2135090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2135362Z 2025-05-07T20:32:14.2135561Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.2135722Z 2025-05-07T20:32:14.2135826Z moe/activation_test.py:117: 2025-05-07T20:32:14.2136116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2136446Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.2136723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.2137274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.2137837Z return fn(*args, **kwargs) 2025-05-07T20:32:14.2138485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.2139164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.2139686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.2140362Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.2141017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.2141537Z kernel = self.compile( 2025-05-07T20:32:14.2142075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.2142717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2143116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.2143340Z 2025-05-07T20:32:14.2143544Z self = 2025-05-07T20:32:14.2144607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.2146037Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59321e40>} 2025-05-07T20:32:14.2147362Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.2148383Z context = 2025-05-07T20:32:14.2148745Z 2025-05-07T20:32:14.2148910Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.2149426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2149892Z module_map=module_map) 2025-05-07T20:32:14.2150248Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.2150607Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.2150879Z E ^ 2025-05-07T20:32:14.2151340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2151788Z 2025-05-07T20:32:14.2152199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.2152705Z 2025-05-07T20:32:14.2152811Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.2153227Z self=, 2025-05-07T20:32:14.2153618Z T=2048, 2025-05-07T20:32:14.2153809Z D=5120, 2025-05-07T20:32:14.2154005Z scale_ub=None, 2025-05-07T20:32:14.2154218Z contiguous=False, 2025-05-07T20:32:14.2154446Z compiled=True, 2025-05-07T20:32:14.2154655Z ) 2025-05-07T20:32:14.5072339Z self = 2025-05-07T20:32:14.5073097Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:14.5073477Z 2025-05-07T20:32:14.5073608Z @given( 2025-05-07T20:32:14.5073938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.5074384Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.5074709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.5075047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.5075390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.5075696Z ) 2025-05-07T20:32:14.5076046Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.5076493Z def test_silu_mul_quant( 2025-05-07T20:32:14.5076747Z self, 2025-05-07T20:32:14.5076946Z T: int, 2025-05-07T20:32:14.5077163Z D: int, 2025-05-07T20:32:14.5077397Z scale_ub: Optional[float], 2025-05-07T20:32:14.5077684Z contiguous: bool, 2025-05-07T20:32:14.5077937Z compiled: bool, 2025-05-07T20:32:14.5078177Z ) -> None: 2025-05-07T20:32:14.5078415Z torch.manual_seed(2025) 2025-05-07T20:32:14.5078669Z 2025-05-07T20:32:14.5078952Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5079300Z 2025-05-07T20:32:14.5087459Z x_sign = torch.sign(x) 2025-05-07T20:32:14.5087783Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.5088105Z x = x_sign * x_clamp 2025-05-07T20:32:14.5088360Z x0 = x[:, :D] 2025-05-07T20:32:14.5088590Z x1 = x[:, D:] 2025-05-07T20:32:14.5088801Z 2025-05-07T20:32:14.5089000Z if contiguous: 2025-05-07T20:32:14.5089241Z x0 = x0.contiguous() 2025-05-07T20:32:14.5089508Z x1 = x1.contiguous() 2025-05-07T20:32:14.5089743Z 2025-05-07T20:32:14.5089941Z if scale_ub is not None: 2025-05-07T20:32:14.5090217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.5090553Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.5090867Z ) 2025-05-07T20:32:14.5091234Z else: 2025-05-07T20:32:14.5091446Z scale_ub_tensor = None 2025-05-07T20:32:14.5091705Z 2025-05-07T20:32:14.5091943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.5092256Z op = silu_mul_quant 2025-05-07T20:32:14.5092519Z if compiled: 2025-05-07T20:32:14.5092770Z op = torch.compile(op) 2025-05-07T20:32:14.5093120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5093523Z 2025-05-07T20:32:14.5093727Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.5093894Z 2025-05-07T20:32:14.5094002Z moe/activation_test.py:117: 2025-05-07T20:32:14.5094295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5094625Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.5094907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5095468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.5096029Z return fn(*args, **kwargs) 2025-05-07T20:32:14.5096683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.5097362Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.5097899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.5098580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.5099240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.5099763Z kernel = self.compile( 2025-05-07T20:32:14.5100303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.5100955Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.5101359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5101587Z 2025-05-07T20:32:14.5101794Z self = 2025-05-07T20:32:14.5102862Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.5104226Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59320d60>} 2025-05-07T20:32:14.5105548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.5106567Z context = 2025-05-07T20:32:14.5106862Z 2025-05-07T20:32:14.5107030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.5107565Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.5108033Z module_map=module_map) 2025-05-07T20:32:14.5108396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.5108761Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.5109027Z E ^ 2025-05-07T20:32:14.5109487Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.5109941Z 2025-05-07T20:32:14.5110354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.5110871Z 2025-05-07T20:32:14.5110976Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.5111492Z self=, 2025-05-07T20:32:14.5111886Z T=2048, 2025-05-07T20:32:14.5112086Z D=5120, 2025-05-07T20:32:14.5112289Z scale_ub=1200.0, 2025-05-07T20:32:14.5112511Z contiguous=False, 2025-05-07T20:32:14.5112747Z compiled=True, 2025-05-07T20:32:14.5112958Z ) 2025-05-07T20:32:14.5113272Z self = 2025-05-07T20:32:14.5113851Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.5114130Z 2025-05-07T20:32:14.5114210Z @given( 2025-05-07T20:32:14.5114449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.5114754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.5115065Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.5115401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.5115752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.5116033Z ) 2025-05-07T20:32:14.5116397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.5116837Z def test_silu_mul_quant( 2025-05-07T20:32:14.5117077Z self, 2025-05-07T20:32:14.5117275Z T: int, 2025-05-07T20:32:14.5117489Z D: int, 2025-05-07T20:32:14.5117731Z scale_ub: Optional[float], 2025-05-07T20:32:14.5118047Z contiguous: bool, 2025-05-07T20:32:14.5118300Z compiled: bool, 2025-05-07T20:32:14.5118529Z ) -> None: 2025-05-07T20:32:14.5118744Z torch.manual_seed(2025) 2025-05-07T20:32:14.5118998Z 2025-05-07T20:32:14.5119279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.5119621Z 2025-05-07T20:32:14.5119823Z x_sign = torch.sign(x) 2025-05-07T20:32:14.5120117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.5120422Z x = x_sign * x_clamp 2025-05-07T20:32:14.5120666Z x0 = x[:, :D] 2025-05-07T20:32:14.5120895Z x1 = x[:, D:] 2025-05-07T20:32:14.5121099Z 2025-05-07T20:32:14.5121291Z if contiguous: 2025-05-07T20:32:14.5121523Z x0 = x0.contiguous() 2025-05-07T20:32:14.5121772Z x1 = x1.contiguous() 2025-05-07T20:32:14.5122015Z 2025-05-07T20:32:14.5122210Z if scale_ub is not None: 2025-05-07T20:32:14.5122474Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.5122816Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.5123130Z ) 2025-05-07T20:32:14.5123327Z else: 2025-05-07T20:32:14.5123532Z scale_ub_tensor = None 2025-05-07T20:32:14.5123789Z 2025-05-07T20:32:14.5124021Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.5124334Z op = silu_mul_quant 2025-05-07T20:32:14.5124588Z if compiled: 2025-05-07T20:32:14.5124841Z op = torch.compile(op) 2025-05-07T20:32:14.5125130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5125415Z 2025-05-07T20:32:14.5125608Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.5125771Z 2025-05-07T20:32:14.5125870Z moe/activation_test.py:117: 2025-05-07T20:32:14.5126171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5126500Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.5126775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.5127332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.5127933Z return fn(*args, **kwargs) 2025-05-07T20:32:14.5128586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.5129260Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.5129794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.5130555Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.5131210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.5131723Z kernel = self.compile( 2025-05-07T20:32:14.5132262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.5132985Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.5133447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.5133678Z 2025-05-07T20:32:14.5133883Z self = 2025-05-07T20:32:14.5134949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.5136301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59031a80>} 2025-05-07T20:32:14.5137679Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.5138693Z context = 2025-05-07T20:32:14.5138982Z 2025-05-07T20:32:14.5139151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.5139665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.5140136Z module_map=module_map) 2025-05-07T20:32:14.5140498Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.5140861Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.5141121Z E ^ 2025-05-07T20:32:14.5141573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.5142025Z 2025-05-07T20:32:14.5142439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.5142951Z 2025-05-07T20:32:14.6863678Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6864958Z self=, 2025-05-07T20:32:14.6865962Z T=4096, 2025-05-07T20:32:14.6866358Z D=5120, 2025-05-07T20:32:14.6866743Z scale_ub=1200.0, 2025-05-07T20:32:14.6867196Z contiguous=True, 2025-05-07T20:32:14.6867574Z compiled=True, 2025-05-07T20:32:14.6867787Z ) 2025-05-07T20:32:14.6868108Z self = 2025-05-07T20:32:14.6868620Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.6868891Z 2025-05-07T20:32:14.6868981Z @given( 2025-05-07T20:32:14.6869216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6869535Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6869934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6870386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6870752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6871042Z ) 2025-05-07T20:32:14.6871398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6871835Z def test_silu_mul_quant( 2025-05-07T20:32:14.6872083Z self, 2025-05-07T20:32:14.6872275Z T: int, 2025-05-07T20:32:14.6872472Z D: int, 2025-05-07T20:32:14.6872695Z scale_ub: Optional[float], 2025-05-07T20:32:14.6872969Z contiguous: bool, 2025-05-07T20:32:14.6873211Z compiled: bool, 2025-05-07T20:32:14.6873634Z ) -> None: 2025-05-07T20:32:14.6873860Z torch.manual_seed(2025) 2025-05-07T20:32:14.6874106Z 2025-05-07T20:32:14.6874384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6874728Z 2025-05-07T20:32:14.6874923Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6875217Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6875646Z x = x_sign * x_clamp 2025-05-07T20:32:14.6875896Z x0 = x[:, :D] 2025-05-07T20:32:14.6876107Z x1 = x[:, D:] 2025-05-07T20:32:14.6876314Z 2025-05-07T20:32:14.6876500Z if contiguous: 2025-05-07T20:32:14.6876738Z x0 = x0.contiguous() 2025-05-07T20:32:14.6876996Z x1 = x1.contiguous() 2025-05-07T20:32:14.6877229Z 2025-05-07T20:32:14.6877417Z if scale_ub is not None: 2025-05-07T20:32:14.6877685Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6878020Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6878323Z ) 2025-05-07T20:32:14.6878514Z else: 2025-05-07T20:32:14.6878725Z scale_ub_tensor = None 2025-05-07T20:32:14.6878977Z 2025-05-07T20:32:14.6879213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6879536Z op = silu_mul_quant 2025-05-07T20:32:14.6879782Z if compiled: 2025-05-07T20:32:14.6880042Z op = torch.compile(op) 2025-05-07T20:32:14.6880350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6880621Z 2025-05-07T20:32:14.6880829Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.6880995Z 2025-05-07T20:32:14.6881101Z moe/activation_test.py:117: 2025-05-07T20:32:14.6881402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6881736Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.6882021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6882586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.6883146Z return fn(*args, **kwargs) 2025-05-07T20:32:14.6883802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.6884481Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.6885011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6885690Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6886348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6886877Z kernel = self.compile( 2025-05-07T20:32:14.6887415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6888069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6888463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6888689Z 2025-05-07T20:32:14.6888901Z self = 2025-05-07T20:32:14.6889966Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6891338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59033420>} 2025-05-07T20:32:14.6892660Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6893869Z context = 2025-05-07T20:32:14.6894157Z 2025-05-07T20:32:14.6894333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6894848Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6895317Z module_map=module_map) 2025-05-07T20:32:14.6895756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6896107Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.6896367Z E ^ 2025-05-07T20:32:14.6896831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6897272Z 2025-05-07T20:32:14.6897694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6898204Z 2025-05-07T20:32:14.6898320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6898733Z self=, 2025-05-07T20:32:14.6899143Z T=128, 2025-05-07T20:32:14.6899333Z D=5120, 2025-05-07T20:32:14.6899532Z scale_ub=1200.0, 2025-05-07T20:32:14.6899762Z contiguous=False, 2025-05-07T20:32:14.6899988Z compiled=True, 2025-05-07T20:32:14.6900196Z ) 2025-05-07T20:32:14.7907621Z self = 2025-05-07T20:32:14.7908846Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:14.7909531Z 2025-05-07T20:32:14.7909755Z @given( 2025-05-07T20:32:14.7910324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7911013Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7911568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7912162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7912770Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7913294Z ) 2025-05-07T20:32:14.7913922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7914725Z def test_silu_mul_quant( 2025-05-07T20:32:14.7915167Z self, 2025-05-07T20:32:14.7915519Z T: int, 2025-05-07T20:32:14.7915883Z D: int, 2025-05-07T20:32:14.7916284Z scale_ub: Optional[float], 2025-05-07T20:32:14.7916783Z contiguous: bool, 2025-05-07T20:32:14.7917214Z compiled: bool, 2025-05-07T20:32:14.7917625Z ) -> None: 2025-05-07T20:32:14.7918009Z torch.manual_seed(2025) 2025-05-07T20:32:14.7918292Z 2025-05-07T20:32:14.7918571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7918915Z 2025-05-07T20:32:14.7919112Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7919411Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7919725Z x = x_sign * x_clamp 2025-05-07T20:32:14.7919969Z x0 = x[:, :D] 2025-05-07T20:32:14.7920190Z x1 = x[:, D:] 2025-05-07T20:32:14.7920402Z 2025-05-07T20:32:14.7920584Z if contiguous: 2025-05-07T20:32:14.7920821Z x0 = x0.contiguous() 2025-05-07T20:32:14.7921080Z x1 = x1.contiguous() 2025-05-07T20:32:14.7921314Z 2025-05-07T20:32:14.7921513Z if scale_ub is not None: 2025-05-07T20:32:14.7921790Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7922125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7922432Z ) 2025-05-07T20:32:14.7922633Z else: 2025-05-07T20:32:14.7922852Z scale_ub_tensor = None 2025-05-07T20:32:14.7923101Z 2025-05-07T20:32:14.7923340Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7923661Z op = silu_mul_quant 2025-05-07T20:32:14.7923918Z if compiled: 2025-05-07T20:32:14.7924173Z op = torch.compile(op) 2025-05-07T20:32:14.7924689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7924966Z 2025-05-07T20:32:14.7925167Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7925332Z 2025-05-07T20:32:14.7925448Z moe/activation_test.py:117: 2025-05-07T20:32:14.7925742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7926082Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7926480Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7927040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.7927596Z return fn(*args, **kwargs) 2025-05-07T20:32:14.7928260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7928942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7929488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7930162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7930817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7931348Z kernel = self.compile( 2025-05-07T20:32:14.7931885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7932539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7932941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7933250Z 2025-05-07T20:32:14.7933463Z self = 2025-05-07T20:32:14.7934532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7935885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59032c00>} 2025-05-07T20:32:14.7937208Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7938227Z context = 2025-05-07T20:32:14.7938512Z 2025-05-07T20:32:14.7938678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7939196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7939662Z module_map=module_map) 2025-05-07T20:32:14.7940036Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7940389Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7940662Z E ^ 2025-05-07T20:32:14.7941129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7941575Z 2025-05-07T20:32:14.7941987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7942501Z 2025-05-07T20:32:14.7942608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.7943023Z self=, 2025-05-07T20:32:14.7943425Z T=16384, 2025-05-07T20:32:14.7943616Z D=7168, 2025-05-07T20:32:14.7943819Z scale_ub=1200.0, 2025-05-07T20:32:14.7944051Z contiguous=True, 2025-05-07T20:32:14.7944272Z compiled=True, 2025-05-07T20:32:14.7944484Z ) 2025-05-07T20:32:14.7944896Z self = 2025-05-07T20:32:14.7945387Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:14.7945669Z 2025-05-07T20:32:14.7945751Z @given( 2025-05-07T20:32:14.7945990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.7946337Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.7946651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.7947056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.7947391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.7947692Z ) 2025-05-07T20:32:14.7948042Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.7948490Z def test_silu_mul_quant( 2025-05-07T20:32:14.7948746Z self, 2025-05-07T20:32:14.7948940Z T: int, 2025-05-07T20:32:14.7949143Z D: int, 2025-05-07T20:32:14.7949372Z scale_ub: Optional[float], 2025-05-07T20:32:14.7949652Z contiguous: bool, 2025-05-07T20:32:14.7949904Z compiled: bool, 2025-05-07T20:32:14.7950131Z ) -> None: 2025-05-07T20:32:14.7950345Z torch.manual_seed(2025) 2025-05-07T20:32:14.7950596Z 2025-05-07T20:32:14.7950872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.7951215Z 2025-05-07T20:32:14.7951412Z x_sign = torch.sign(x) 2025-05-07T20:32:14.7951711Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.7952026Z x = x_sign * x_clamp 2025-05-07T20:32:14.7952266Z x0 = x[:, :D] 2025-05-07T20:32:14.7952488Z x1 = x[:, D:] 2025-05-07T20:32:14.7952698Z 2025-05-07T20:32:14.7952883Z if contiguous: 2025-05-07T20:32:14.7953125Z x0 = x0.contiguous() 2025-05-07T20:32:14.7953383Z x1 = x1.contiguous() 2025-05-07T20:32:14.7953620Z 2025-05-07T20:32:14.7953817Z if scale_ub is not None: 2025-05-07T20:32:14.7954097Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.7954432Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.7954740Z ) 2025-05-07T20:32:14.7954943Z else: 2025-05-07T20:32:14.7955150Z scale_ub_tensor = None 2025-05-07T20:32:14.7955407Z 2025-05-07T20:32:14.7955643Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.7955951Z op = silu_mul_quant 2025-05-07T20:32:14.7956215Z if compiled: 2025-05-07T20:32:14.7956471Z op = torch.compile(op) 2025-05-07T20:32:14.7956775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7957044Z 2025-05-07T20:32:14.7957244Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.7957411Z 2025-05-07T20:32:14.7957515Z moe/activation_test.py:117: 2025-05-07T20:32:14.7957809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7958139Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.7958425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.7958977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:14.7959994Z return fn(*args, **kwargs) 2025-05-07T20:32:14.7960667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.7961358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.7961900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.7962582Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.7963248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.7963787Z kernel = self.compile( 2025-05-07T20:32:14.7964454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.7971469Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.7971903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.7972136Z 2025-05-07T20:32:14.7972350Z self = 2025-05-07T20:32:14.7973479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.7975067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59b67740>} 2025-05-07T20:32:14.7976416Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.7977429Z context = 2025-05-07T20:32:14.7977728Z 2025-05-07T20:32:14.7977900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.7978423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.7978910Z module_map=module_map) 2025-05-07T20:32:14.7979278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.7979636Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.7979903Z E ^ 2025-05-07T20:32:14.7980368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.7980818Z 2025-05-07T20:32:14.7981234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.7981751Z 2025-05-07T20:32:14.9194849Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.9195444Z self=, 2025-05-07T20:32:14.9196060Z T=16384, 2025-05-07T20:32:14.9196331Z D=5120, 2025-05-07T20:32:14.9196608Z scale_ub=1200.0, 2025-05-07T20:32:14.9196919Z contiguous=True, 2025-05-07T20:32:14.9197227Z compiled=False, 2025-05-07T20:32:14.9197544Z ) 2025-05-07T20:32:14.9197885Z self = 2025-05-07T20:32:14.9198395Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:14.9198674Z 2025-05-07T20:32:14.9198759Z @given( 2025-05-07T20:32:14.9198995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.9199311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.9199615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.9199959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.9200290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.9200578Z ) 2025-05-07T20:32:14.9200936Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.9201378Z def test_silu_mul_quant( 2025-05-07T20:32:14.9201637Z self, 2025-05-07T20:32:14.9201842Z T: int, 2025-05-07T20:32:14.9202056Z D: int, 2025-05-07T20:32:14.9202291Z scale_ub: Optional[float], 2025-05-07T20:32:14.9202562Z contiguous: bool, 2025-05-07T20:32:14.9202815Z compiled: bool, 2025-05-07T20:32:14.9203053Z ) -> None: 2025-05-07T20:32:14.9203268Z torch.manual_seed(2025) 2025-05-07T20:32:14.9203515Z 2025-05-07T20:32:14.9203794Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.9204140Z 2025-05-07T20:32:14.9204340Z x_sign = torch.sign(x) 2025-05-07T20:32:14.9204814Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.9205133Z x = x_sign * x_clamp 2025-05-07T20:32:14.9205386Z x0 = x[:, :D] 2025-05-07T20:32:14.9205616Z x1 = x[:, D:] 2025-05-07T20:32:14.9205832Z 2025-05-07T20:32:14.9206031Z if contiguous: 2025-05-07T20:32:14.9206276Z x0 = x0.contiguous() 2025-05-07T20:32:14.9206533Z x1 = x1.contiguous() 2025-05-07T20:32:14.9206786Z 2025-05-07T20:32:14.9207108Z if scale_ub is not None: 2025-05-07T20:32:14.9207399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.9207737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.9208057Z ) 2025-05-07T20:32:14.9208263Z else: 2025-05-07T20:32:14.9208478Z scale_ub_tensor = None 2025-05-07T20:32:14.9208740Z 2025-05-07T20:32:14.9208982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.9209295Z op = silu_mul_quant 2025-05-07T20:32:14.9209571Z if compiled: 2025-05-07T20:32:14.9209817Z op = torch.compile(op) 2025-05-07T20:32:14.9210111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9210383Z 2025-05-07T20:32:14.9210582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.9210751Z 2025-05-07T20:32:14.9210850Z moe/activation_test.py:117: 2025-05-07T20:32:14.9211158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9211510Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.9211792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9212476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.9213272Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.9213809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.9214488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.9215146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.9215681Z kernel = self.compile( 2025-05-07T20:32:14.9216228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.9216871Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.9217275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9217502Z 2025-05-07T20:32:14.9217714Z self = 2025-05-07T20:32:14.9218803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.9220267Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a596c7560>} 2025-05-07T20:32:14.9221604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.9222629Z context = 2025-05-07T20:32:14.9222917Z 2025-05-07T20:32:14.9223094Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.9223608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.9224080Z module_map=module_map) 2025-05-07T20:32:14.9224452Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.9224813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.9225167Z E ^ 2025-05-07T20:32:14.9225667Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.9226127Z 2025-05-07T20:32:14.9226540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.9227044Z 2025-05-07T20:32:14.9227156Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.9227657Z self=, 2025-05-07T20:32:14.9228059Z T=1, 2025-05-07T20:32:14.9228252Z D=7168, 2025-05-07T20:32:14.9228459Z scale_ub=1200.0, 2025-05-07T20:32:14.9228686Z contiguous=False, 2025-05-07T20:32:14.9228925Z compiled=False, 2025-05-07T20:32:14.9229143Z ) 2025-05-07T20:32:14.9229464Z self = 2025-05-07T20:32:14.9230087Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:14.9230372Z 2025-05-07T20:32:14.9230488Z @given( 2025-05-07T20:32:14.9230762Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.9231077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.9231390Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.9231719Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.9232048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.9232341Z ) 2025-05-07T20:32:14.9232692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.9233131Z def test_silu_mul_quant( 2025-05-07T20:32:14.9233383Z self, 2025-05-07T20:32:14.9233585Z T: int, 2025-05-07T20:32:14.9233783Z D: int, 2025-05-07T20:32:14.9234015Z scale_ub: Optional[float], 2025-05-07T20:32:14.9234305Z contiguous: bool, 2025-05-07T20:32:14.9234555Z compiled: bool, 2025-05-07T20:32:14.9234787Z ) -> None: 2025-05-07T20:32:14.9235022Z torch.manual_seed(2025) 2025-05-07T20:32:14.9235272Z 2025-05-07T20:32:14.9235552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.9235897Z 2025-05-07T20:32:14.9236089Z x_sign = torch.sign(x) 2025-05-07T20:32:14.9236386Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.9236703Z x = x_sign * x_clamp 2025-05-07T20:32:14.9236955Z x0 = x[:, :D] 2025-05-07T20:32:14.9237179Z x1 = x[:, D:] 2025-05-07T20:32:14.9237397Z 2025-05-07T20:32:14.9237603Z if contiguous: 2025-05-07T20:32:14.9237836Z x0 = x0.contiguous() 2025-05-07T20:32:14.9238105Z x1 = x1.contiguous() 2025-05-07T20:32:14.9238353Z 2025-05-07T20:32:14.9238540Z if scale_ub is not None: 2025-05-07T20:32:14.9238817Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.9239159Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.9239468Z ) 2025-05-07T20:32:14.9239673Z else: 2025-05-07T20:32:14.9239896Z scale_ub_tensor = None 2025-05-07T20:32:14.9240152Z 2025-05-07T20:32:14.9240390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.9240700Z op = silu_mul_quant 2025-05-07T20:32:14.9240944Z if compiled: 2025-05-07T20:32:14.9241257Z op = torch.compile(op) 2025-05-07T20:32:14.9241633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9241913Z 2025-05-07T20:32:14.9242108Z > y_fp8, y_scale = fn() 2025-05-07T20:32:14.9242276Z 2025-05-07T20:32:14.9242375Z moe/activation_test.py:117: 2025-05-07T20:32:14.9242676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9243008Z moe/activation_test.py:115: in fn 2025-05-07T20:32:14.9243285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.9244074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:14.9244758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:14.9245296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.9245974Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.9246624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.9247233Z kernel = self.compile( 2025-05-07T20:32:14.9247772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.9248420Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.9248813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.9249044Z 2025-05-07T20:32:14.9249256Z self = 2025-05-07T20:32:14.9250316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.9251669Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a704ddf80>} 2025-05-07T20:32:14.9253075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.9254112Z context = 2025-05-07T20:32:14.9254402Z 2025-05-07T20:32:14.9254567Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.9255086Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.9255549Z module_map=module_map) 2025-05-07T20:32:14.9255918Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.9256275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.9256536Z E ^ 2025-05-07T20:32:14.9256999Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.9257451Z 2025-05-07T20:32:14.9257864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.9258371Z 2025-05-07T20:32:15.0998503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.0999122Z self=, 2025-05-07T20:32:15.0999695Z T=4096, 2025-05-07T20:32:15.0999968Z D=7168, 2025-05-07T20:32:15.1000245Z scale_ub=1200.0, 2025-05-07T20:32:15.1000569Z contiguous=False, 2025-05-07T20:32:15.1000803Z compiled=True, 2025-05-07T20:32:15.1001023Z ) 2025-05-07T20:32:15.1001358Z self = 2025-05-07T20:32:15.1001853Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.1002135Z 2025-05-07T20:32:15.1002220Z @given( 2025-05-07T20:32:15.1002464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.1002786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.1003101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.1003441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.1003779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.1004069Z ) 2025-05-07T20:32:15.1004432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.1004884Z def test_silu_mul_quant( 2025-05-07T20:32:15.1005300Z self, 2025-05-07T20:32:15.1005505Z T: int, 2025-05-07T20:32:15.1005704Z D: int, 2025-05-07T20:32:15.1005920Z scale_ub: Optional[float], 2025-05-07T20:32:15.1006195Z contiguous: bool, 2025-05-07T20:32:15.1006441Z compiled: bool, 2025-05-07T20:32:15.1006666Z ) -> None: 2025-05-07T20:32:15.1006892Z torch.manual_seed(2025) 2025-05-07T20:32:15.1007138Z 2025-05-07T20:32:15.1007525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.1007872Z 2025-05-07T20:32:15.1008076Z x_sign = torch.sign(x) 2025-05-07T20:32:15.1008375Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.1008689Z x = x_sign * x_clamp 2025-05-07T20:32:15.1008938Z x0 = x[:, :D] 2025-05-07T20:32:15.1009159Z x1 = x[:, D:] 2025-05-07T20:32:15.1009367Z 2025-05-07T20:32:15.1009561Z if contiguous: 2025-05-07T20:32:15.1009803Z x0 = x0.contiguous() 2025-05-07T20:32:15.1010065Z x1 = x1.contiguous() 2025-05-07T20:32:15.1010311Z 2025-05-07T20:32:15.1010518Z if scale_ub is not None: 2025-05-07T20:32:15.1010789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.1011124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.1011442Z ) 2025-05-07T20:32:15.1011634Z else: 2025-05-07T20:32:15.1011849Z scale_ub_tensor = None 2025-05-07T20:32:15.1012109Z 2025-05-07T20:32:15.1012336Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.1012651Z op = silu_mul_quant 2025-05-07T20:32:15.1012902Z if compiled: 2025-05-07T20:32:15.1013231Z op = torch.compile(op) 2025-05-07T20:32:15.1013525Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1013800Z 2025-05-07T20:32:15.1013996Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.1014158Z 2025-05-07T20:32:15.1014259Z moe/activation_test.py:117: 2025-05-07T20:32:15.1014561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1014899Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.1015176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1015733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.1016294Z return fn(*args, **kwargs) 2025-05-07T20:32:15.1016954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.1017628Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.1018158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.1018837Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.1019495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.1020029Z kernel = self.compile( 2025-05-07T20:32:15.1020571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.1021222Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.1021615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1021851Z 2025-05-07T20:32:15.1022058Z self = 2025-05-07T20:32:15.1023130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.1024611Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a701edb20>} 2025-05-07T20:32:15.1025935Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.1026941Z context = 2025-05-07T20:32:15.1027232Z 2025-05-07T20:32:15.1027475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.1027992Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.1028463Z module_map=module_map) 2025-05-07T20:32:15.1028829Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.1029184Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.1029453Z E ^ 2025-05-07T20:32:15.1029916Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.1030369Z 2025-05-07T20:32:15.1030782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.1031294Z 2025-05-07T20:32:15.1031401Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.1031818Z self=, 2025-05-07T20:32:15.1032221Z T=128, 2025-05-07T20:32:15.1032415Z D=7168, 2025-05-07T20:32:15.1032613Z scale_ub=1200.0, 2025-05-07T20:32:15.1032835Z contiguous=False, 2025-05-07T20:32:15.1033068Z compiled=True, 2025-05-07T20:32:15.1033281Z ) 2025-05-07T20:32:15.1941617Z self = 2025-05-07T20:32:15.1942398Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.1942780Z 2025-05-07T20:32:15.1942895Z @given( 2025-05-07T20:32:15.1943239Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.1943683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.1944010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.1944359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.1944700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.1944998Z ) 2025-05-07T20:32:15.1945361Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.1945828Z def test_silu_mul_quant( 2025-05-07T20:32:15.1946090Z self, 2025-05-07T20:32:15.1946299Z T: int, 2025-05-07T20:32:15.1946516Z D: int, 2025-05-07T20:32:15.1946752Z scale_ub: Optional[float], 2025-05-07T20:32:15.1947035Z contiguous: bool, 2025-05-07T20:32:15.1947293Z compiled: bool, 2025-05-07T20:32:15.1947538Z ) -> None: 2025-05-07T20:32:15.1947767Z torch.manual_seed(2025) 2025-05-07T20:32:15.1948030Z 2025-05-07T20:32:15.1948318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.1948666Z 2025-05-07T20:32:15.1948876Z x_sign = torch.sign(x) 2025-05-07T20:32:15.1949180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.1949496Z x = x_sign * x_clamp 2025-05-07T20:32:15.1949755Z x0 = x[:, :D] 2025-05-07T20:32:15.1949988Z x1 = x[:, D:] 2025-05-07T20:32:15.1950212Z 2025-05-07T20:32:15.1950428Z if contiguous: 2025-05-07T20:32:15.1950675Z x0 = x0.contiguous() 2025-05-07T20:32:15.1950946Z x1 = x1.contiguous() 2025-05-07T20:32:15.1951208Z 2025-05-07T20:32:15.1951417Z if scale_ub is not None: 2025-05-07T20:32:15.1951703Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.1952058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.1952385Z ) 2025-05-07T20:32:15.1952595Z else: 2025-05-07T20:32:15.1952820Z scale_ub_tensor = None 2025-05-07T20:32:15.1953262Z 2025-05-07T20:32:15.1953509Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.1953821Z op = silu_mul_quant 2025-05-07T20:32:15.1954083Z if compiled: 2025-05-07T20:32:15.1954350Z op = torch.compile(op) 2025-05-07T20:32:15.1954650Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1954931Z 2025-05-07T20:32:15.1955135Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.1955444Z 2025-05-07T20:32:15.1955546Z moe/activation_test.py:117: 2025-05-07T20:32:15.1955847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1956184Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.1956471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1957094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.1957762Z return fn(*args, **kwargs) 2025-05-07T20:32:15.1958552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.1959556Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.1960094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.1960776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.1961447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.1961972Z kernel = self.compile( 2025-05-07T20:32:15.1962523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.1963178Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.1963574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1963810Z 2025-05-07T20:32:15.1964024Z self = 2025-05-07T20:32:15.1965086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.1966439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a70252480>} 2025-05-07T20:32:15.1967765Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.1968767Z context = 2025-05-07T20:32:15.1969054Z 2025-05-07T20:32:15.1969224Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.1969739Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.1970203Z module_map=module_map) 2025-05-07T20:32:15.1970558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.1970913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.1971175Z E ^ 2025-05-07T20:32:15.1971641Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.1972090Z 2025-05-07T20:32:15.1972500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.1973085Z 2025-05-07T20:32:15.1973194Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.1973604Z self=, 2025-05-07T20:32:15.1974006Z T=2048, 2025-05-07T20:32:15.1974348Z D=7168, 2025-05-07T20:32:15.1974546Z scale_ub=None, 2025-05-07T20:32:15.1974766Z contiguous=True, 2025-05-07T20:32:15.1974987Z compiled=True, 2025-05-07T20:32:15.1975202Z ) 2025-05-07T20:32:15.1975519Z self = 2025-05-07T20:32:15.1976008Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:15.1976274Z 2025-05-07T20:32:15.1976468Z @given( 2025-05-07T20:32:15.1976706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.1977024Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.1977331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.1977662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.1977998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.1978288Z ) 2025-05-07T20:32:15.1978646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.1979091Z def test_silu_mul_quant( 2025-05-07T20:32:15.1979337Z self, 2025-05-07T20:32:15.1979541Z T: int, 2025-05-07T20:32:15.1979732Z D: int, 2025-05-07T20:32:15.1979954Z scale_ub: Optional[float], 2025-05-07T20:32:15.1980232Z contiguous: bool, 2025-05-07T20:32:15.1986968Z compiled: bool, 2025-05-07T20:32:15.1987226Z ) -> None: 2025-05-07T20:32:15.1987446Z torch.manual_seed(2025) 2025-05-07T20:32:15.1987704Z 2025-05-07T20:32:15.1987975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.1988319Z 2025-05-07T20:32:15.1988518Z x_sign = torch.sign(x) 2025-05-07T20:32:15.1988803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.1989112Z x = x_sign * x_clamp 2025-05-07T20:32:15.1989360Z x0 = x[:, :D] 2025-05-07T20:32:15.1989578Z x1 = x[:, D:] 2025-05-07T20:32:15.1989783Z 2025-05-07T20:32:15.1989972Z if contiguous: 2025-05-07T20:32:15.1990206Z x0 = x0.contiguous() 2025-05-07T20:32:15.1990459Z x1 = x1.contiguous() 2025-05-07T20:32:15.1990704Z 2025-05-07T20:32:15.1990891Z if scale_ub is not None: 2025-05-07T20:32:15.1991154Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.1991487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.1991796Z ) 2025-05-07T20:32:15.1991986Z else: 2025-05-07T20:32:15.1992194Z scale_ub_tensor = None 2025-05-07T20:32:15.1992443Z 2025-05-07T20:32:15.1992670Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.1992981Z op = silu_mul_quant 2025-05-07T20:32:15.1993227Z if compiled: 2025-05-07T20:32:15.1993468Z op = torch.compile(op) 2025-05-07T20:32:15.1993762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1994037Z 2025-05-07T20:32:15.1994221Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.1994400Z 2025-05-07T20:32:15.1994498Z moe/activation_test.py:117: 2025-05-07T20:32:15.1994797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1995125Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.1995397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1995950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.1996508Z return fn(*args, **kwargs) 2025-05-07T20:32:15.1997151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.1997855Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.1998419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.1999086Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.1999845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.2000371Z kernel = self.compile( 2025-05-07T20:32:15.2000902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.2001539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.2001930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2002237Z 2025-05-07T20:32:15.2002441Z self = 2025-05-07T20:32:15.2003503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.2004863Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a42ce5da0>} 2025-05-07T20:32:15.2006178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.2007189Z context = 2025-05-07T20:32:15.2007482Z 2025-05-07T20:32:15.2007644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.2008148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.2008605Z module_map=module_map) 2025-05-07T20:32:15.2008963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.2009306Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.2009564Z E ^ 2025-05-07T20:32:15.2010033Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.2010481Z 2025-05-07T20:32:15.2010888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.2011387Z 2025-05-07T20:32:15.2632626Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2633241Z self=, 2025-05-07T20:32:15.2633864Z T=16384, 2025-05-07T20:32:15.2634119Z D=5120, 2025-05-07T20:32:15.2634402Z scale_ub=None, 2025-05-07T20:32:15.2634704Z contiguous=False, 2025-05-07T20:32:15.2634970Z compiled=False, 2025-05-07T20:32:15.2635176Z ) 2025-05-07T20:32:15.2635490Z self = 2025-05-07T20:32:15.2635988Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.2636265Z 2025-05-07T20:32:15.2636341Z @given( 2025-05-07T20:32:15.2636587Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2636898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2637194Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2637520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2637847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2638127Z ) 2025-05-07T20:32:15.2638479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2638913Z def test_silu_mul_quant( 2025-05-07T20:32:15.2639149Z self, 2025-05-07T20:32:15.2639337Z T: int, 2025-05-07T20:32:15.2639539Z D: int, 2025-05-07T20:32:15.2639760Z scale_ub: Optional[float], 2025-05-07T20:32:15.2640028Z contiguous: bool, 2025-05-07T20:32:15.2640264Z compiled: bool, 2025-05-07T20:32:15.2640486Z ) -> None: 2025-05-07T20:32:15.2640695Z torch.manual_seed(2025) 2025-05-07T20:32:15.2641119Z 2025-05-07T20:32:15.2641406Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2641747Z 2025-05-07T20:32:15.2641941Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2642230Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2644226Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.2646176Z 2025-05-07T20:32:15.2646301Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:15.2646522Z 2025-05-07T20:32:15.2646635Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2647048Z self=, 2025-05-07T20:32:15.2647451Z T=4096, 2025-05-07T20:32:15.2647645Z D=7168, 2025-05-07T20:32:15.2647840Z scale_ub=1200.0, 2025-05-07T20:32:15.2648072Z contiguous=True, 2025-05-07T20:32:15.2648297Z compiled=True, 2025-05-07T20:32:15.2648503Z ) 2025-05-07T20:32:15.2648821Z self = 2025-05-07T20:32:15.2649326Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.2649596Z 2025-05-07T20:32:15.2649678Z @given( 2025-05-07T20:32:15.2649916Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2650231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2650533Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2650876Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2651212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2651501Z ) 2025-05-07T20:32:15.2651844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2652289Z def test_silu_mul_quant( 2025-05-07T20:32:15.2652542Z self, 2025-05-07T20:32:15.2652732Z T: int, 2025-05-07T20:32:15.2652942Z D: int, 2025-05-07T20:32:15.2653238Z scale_ub: Optional[float], 2025-05-07T20:32:15.2653511Z contiguous: bool, 2025-05-07T20:32:15.2653755Z compiled: bool, 2025-05-07T20:32:15.2653981Z ) -> None: 2025-05-07T20:32:15.2654198Z torch.manual_seed(2025) 2025-05-07T20:32:15.2654448Z 2025-05-07T20:32:15.2654721Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2655058Z 2025-05-07T20:32:15.2655261Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2655563Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2657533Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.2659543Z 2025-05-07T20:32:15.2659675Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:15.2659884Z 2025-05-07T20:32:15.2659993Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2660402Z self=, 2025-05-07T20:32:15.2660799Z T=16384, 2025-05-07T20:32:15.2660994Z D=7168, 2025-05-07T20:32:15.2661321Z scale_ub=None, 2025-05-07T20:32:15.2661538Z contiguous=False, 2025-05-07T20:32:15.2661769Z compiled=False, 2025-05-07T20:32:15.2661975Z ) 2025-05-07T20:32:15.2662292Z self = 2025-05-07T20:32:15.2662793Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.2663068Z 2025-05-07T20:32:15.2663144Z @given( 2025-05-07T20:32:15.2663489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2663810Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2664115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2664446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2664777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2665060Z ) 2025-05-07T20:32:15.2665405Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2665856Z def test_silu_mul_quant( 2025-05-07T20:32:15.2666100Z self, 2025-05-07T20:32:15.2666297Z T: int, 2025-05-07T20:32:15.2666501Z D: int, 2025-05-07T20:32:15.2666720Z scale_ub: Optional[float], 2025-05-07T20:32:15.2666997Z contiguous: bool, 2025-05-07T20:32:15.2667243Z compiled: bool, 2025-05-07T20:32:15.2667465Z ) -> None: 2025-05-07T20:32:15.2667682Z torch.manual_seed(2025) 2025-05-07T20:32:15.2667932Z 2025-05-07T20:32:15.2668196Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2670223Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.2672060Z 2025-05-07T20:32:15.2672179Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.2672393Z 2025-05-07T20:32:15.2672502Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2672915Z self=, 2025-05-07T20:32:15.2673315Z T=2048, 2025-05-07T20:32:15.2673510Z D=7168, 2025-05-07T20:32:15.2673708Z scale_ub=1200.0, 2025-05-07T20:32:15.2673926Z contiguous=True, 2025-05-07T20:32:15.2674151Z compiled=True, 2025-05-07T20:32:15.2674355Z ) 2025-05-07T20:32:15.2674666Z self = 2025-05-07T20:32:15.2675158Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.2675422Z 2025-05-07T20:32:15.2675505Z @given( 2025-05-07T20:32:15.2675734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2676049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2676356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2676681Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2676999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2677285Z ) 2025-05-07T20:32:15.2677640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2678076Z def test_silu_mul_quant( 2025-05-07T20:32:15.2678311Z self, 2025-05-07T20:32:15.2678512Z T: int, 2025-05-07T20:32:15.2678708Z D: int, 2025-05-07T20:32:15.2678929Z scale_ub: Optional[float], 2025-05-07T20:32:15.2679205Z contiguous: bool, 2025-05-07T20:32:15.2679439Z compiled: bool, 2025-05-07T20:32:15.2679671Z ) -> None: 2025-05-07T20:32:15.2679882Z torch.manual_seed(2025) 2025-05-07T20:32:15.2680118Z 2025-05-07T20:32:15.2680468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2680807Z 2025-05-07T20:32:15.2681010Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2681300Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2683257Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.2685179Z 2025-05-07T20:32:15.2685300Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:15.2685511Z 2025-05-07T20:32:15.2685627Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2686036Z self=, 2025-05-07T20:32:15.2686436Z T=2048, 2025-05-07T20:32:15.2686636Z D=7168, 2025-05-07T20:32:15.2686842Z scale_ub=None, 2025-05-07T20:32:15.2687056Z contiguous=True, 2025-05-07T20:32:15.2687288Z compiled=False, 2025-05-07T20:32:15.2687496Z ) 2025-05-07T20:32:15.3817479Z self = 2025-05-07T20:32:15.3818239Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.3818617Z 2025-05-07T20:32:15.3818729Z @given( 2025-05-07T20:32:15.3819044Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.3819360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.3819664Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.3819988Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.3820319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.3820609Z ) 2025-05-07T20:32:15.3820959Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.3821396Z def test_silu_mul_quant( 2025-05-07T20:32:15.3821629Z self, 2025-05-07T20:32:15.3821826Z T: int, 2025-05-07T20:32:15.3822021Z D: int, 2025-05-07T20:32:15.3822232Z scale_ub: Optional[float], 2025-05-07T20:32:15.3822513Z contiguous: bool, 2025-05-07T20:32:15.3822755Z compiled: bool, 2025-05-07T20:32:15.3822976Z ) -> None: 2025-05-07T20:32:15.3823189Z torch.manual_seed(2025) 2025-05-07T20:32:15.3823427Z 2025-05-07T20:32:15.3823692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.3824043Z 2025-05-07T20:32:15.3824243Z > x_sign = torch.sign(x) 2025-05-07T20:32:15.3826163Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.3827988Z 2025-05-07T20:32:15.3828111Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:15.3828319Z 2025-05-07T20:32:15.3828420Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.3828834Z self=, 2025-05-07T20:32:15.3829230Z T=1, 2025-05-07T20:32:15.3829406Z D=7168, 2025-05-07T20:32:15.3829603Z scale_ub=1200.0, 2025-05-07T20:32:15.3829831Z contiguous=True, 2025-05-07T20:32:15.3830589Z compiled=False, 2025-05-07T20:32:15.3830804Z ) 2025-05-07T20:32:15.3831123Z self = 2025-05-07T20:32:15.3831597Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.3831868Z 2025-05-07T20:32:15.3831949Z @given( 2025-05-07T20:32:15.3832185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.3832498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.3832913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.3833237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.3833565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.3833840Z ) 2025-05-07T20:32:15.3834183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.3834625Z def test_silu_mul_quant( 2025-05-07T20:32:15.3834859Z self, 2025-05-07T20:32:15.3835056Z T: int, 2025-05-07T20:32:15.3835263Z D: int, 2025-05-07T20:32:15.3835474Z scale_ub: Optional[float], 2025-05-07T20:32:15.3835747Z contiguous: bool, 2025-05-07T20:32:15.3835989Z compiled: bool, 2025-05-07T20:32:15.3836217Z ) -> None: 2025-05-07T20:32:15.3836425Z torch.manual_seed(2025) 2025-05-07T20:32:15.3836672Z 2025-05-07T20:32:15.3836941Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.3837276Z 2025-05-07T20:32:15.3837474Z x_sign = torch.sign(x) 2025-05-07T20:32:15.3837762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.3838066Z x = x_sign * x_clamp 2025-05-07T20:32:15.3838309Z x0 = x[:, :D] 2025-05-07T20:32:15.3838527Z x1 = x[:, D:] 2025-05-07T20:32:15.3838727Z 2025-05-07T20:32:15.3838922Z if contiguous: 2025-05-07T20:32:15.3839152Z x0 = x0.contiguous() 2025-05-07T20:32:15.3839408Z x1 = x1.contiguous() 2025-05-07T20:32:15.3839651Z 2025-05-07T20:32:15.3839846Z if scale_ub is not None: 2025-05-07T20:32:15.3840111Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.3840445Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.3840753Z ) 2025-05-07T20:32:15.3840937Z else: 2025-05-07T20:32:15.3841150Z scale_ub_tensor = None 2025-05-07T20:32:15.3841404Z 2025-05-07T20:32:15.3841634Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.3841945Z op = silu_mul_quant 2025-05-07T20:32:15.3842193Z if compiled: 2025-05-07T20:32:15.3842446Z op = torch.compile(op) 2025-05-07T20:32:15.3842734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.3843012Z 2025-05-07T20:32:15.3843207Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.3843368Z 2025-05-07T20:32:15.3843469Z moe/activation_test.py:117: 2025-05-07T20:32:15.3843776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.3844104Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.3844376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.3845061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.3845745Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.3846274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.3846947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.3847599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.3848130Z kernel = self.compile( 2025-05-07T20:32:15.3848667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.3849397Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.3849795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.3850018Z 2025-05-07T20:32:15.3850229Z self = 2025-05-07T20:32:15.3851286Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.3852712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a421c1440>} 2025-05-07T20:32:15.3854130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.3855140Z context = 2025-05-07T20:32:15.3855421Z 2025-05-07T20:32:15.3855590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.3856094Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.3856558Z module_map=module_map) 2025-05-07T20:32:15.3856929Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.3857280Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.3857544Z E ^ 2025-05-07T20:32:15.3858009Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.3858451Z 2025-05-07T20:32:15.3858866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.3859537Z 2025-05-07T20:32:15.3859652Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.3860070Z self=, 2025-05-07T20:32:15.3860474Z T=128, 2025-05-07T20:32:15.3860672Z D=5120, 2025-05-07T20:32:15.3860864Z scale_ub=None, 2025-05-07T20:32:15.3861083Z contiguous=True, 2025-05-07T20:32:15.3861312Z compiled=False, 2025-05-07T20:32:15.3861518Z ) 2025-05-07T20:32:15.4534292Z self = 2025-05-07T20:32:15.4535046Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.4535422Z 2025-05-07T20:32:15.4535543Z @given( 2025-05-07T20:32:15.4535858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4536294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4536720Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4537049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4537389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4537685Z ) 2025-05-07T20:32:15.4538034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4538478Z def test_silu_mul_quant( 2025-05-07T20:32:15.4538727Z self, 2025-05-07T20:32:15.4538934Z T: int, 2025-05-07T20:32:15.4539131Z D: int, 2025-05-07T20:32:15.4539352Z scale_ub: Optional[float], 2025-05-07T20:32:15.4539633Z contiguous: bool, 2025-05-07T20:32:15.4539871Z compiled: bool, 2025-05-07T20:32:15.4540103Z ) -> None: 2025-05-07T20:32:15.4540328Z torch.manual_seed(2025) 2025-05-07T20:32:15.4540568Z 2025-05-07T20:32:15.4540850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4541195Z 2025-05-07T20:32:15.4541390Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4541688Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4541998Z x = x_sign * x_clamp 2025-05-07T20:32:15.4542431Z x0 = x[:, :D] 2025-05-07T20:32:15.4542661Z x1 = x[:, D:] 2025-05-07T20:32:15.4542878Z 2025-05-07T20:32:15.4543065Z if contiguous: 2025-05-07T20:32:15.4543299Z x0 = x0.contiguous() 2025-05-07T20:32:15.4543559Z x1 = x1.contiguous() 2025-05-07T20:32:15.4543806Z 2025-05-07T20:32:15.4544003Z if scale_ub is not None: 2025-05-07T20:32:15.4544284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4544742Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4545050Z ) 2025-05-07T20:32:15.4545249Z else: 2025-05-07T20:32:15.4545466Z scale_ub_tensor = None 2025-05-07T20:32:15.4545717Z 2025-05-07T20:32:15.4545952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4546267Z op = silu_mul_quant 2025-05-07T20:32:15.4546520Z if compiled: 2025-05-07T20:32:15.4546770Z op = torch.compile(op) 2025-05-07T20:32:15.4547073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4547343Z 2025-05-07T20:32:15.4547540Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4547707Z 2025-05-07T20:32:15.4547816Z moe/activation_test.py:117: 2025-05-07T20:32:15.4548110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4548445Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4548731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4549425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4550106Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4550643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4551315Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4551982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4552509Z kernel = self.compile( 2025-05-07T20:32:15.4553050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4553698Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4554089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4554325Z 2025-05-07T20:32:15.4554532Z self = 2025-05-07T20:32:15.4555624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4564137Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a421c2520>} 2025-05-07T20:32:15.4565476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4566484Z context = 2025-05-07T20:32:15.4566779Z 2025-05-07T20:32:15.4566945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4567475Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4567934Z module_map=module_map) 2025-05-07T20:32:15.4568302Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4568655Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4568908Z E ^ 2025-05-07T20:32:15.4569527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4569977Z 2025-05-07T20:32:15.4570391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.4570895Z 2025-05-07T20:32:15.4571008Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.4571409Z self=, 2025-05-07T20:32:15.4571916Z T=128, 2025-05-07T20:32:15.4572103Z D=7168, 2025-05-07T20:32:15.4572292Z scale_ub=None, 2025-05-07T20:32:15.4572502Z contiguous=True, 2025-05-07T20:32:15.4572721Z compiled=False, 2025-05-07T20:32:15.4572918Z ) 2025-05-07T20:32:15.4573312Z self = 2025-05-07T20:32:15.4573786Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.4574047Z 2025-05-07T20:32:15.4574130Z @given( 2025-05-07T20:32:15.4574363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4574666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4574968Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4575283Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4575611Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4575900Z ) 2025-05-07T20:32:15.4576242Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4576684Z def test_silu_mul_quant( 2025-05-07T20:32:15.4576925Z self, 2025-05-07T20:32:15.4577113Z T: int, 2025-05-07T20:32:15.4577313Z D: int, 2025-05-07T20:32:15.4577529Z scale_ub: Optional[float], 2025-05-07T20:32:15.4577803Z contiguous: bool, 2025-05-07T20:32:15.4578074Z compiled: bool, 2025-05-07T20:32:15.4578306Z ) -> None: 2025-05-07T20:32:15.4578524Z torch.manual_seed(2025) 2025-05-07T20:32:15.4578754Z 2025-05-07T20:32:15.4579027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4579360Z 2025-05-07T20:32:15.4579550Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4579838Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4580142Z x = x_sign * x_clamp 2025-05-07T20:32:15.4580376Z x0 = x[:, :D] 2025-05-07T20:32:15.4580592Z x1 = x[:, D:] 2025-05-07T20:32:15.4580798Z 2025-05-07T20:32:15.4580976Z if contiguous: 2025-05-07T20:32:15.4581205Z x0 = x0.contiguous() 2025-05-07T20:32:15.4581457Z x1 = x1.contiguous() 2025-05-07T20:32:15.4581687Z 2025-05-07T20:32:15.4581876Z if scale_ub is not None: 2025-05-07T20:32:15.4582142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4582470Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4582769Z ) 2025-05-07T20:32:15.4582957Z else: 2025-05-07T20:32:15.4583170Z scale_ub_tensor = None 2025-05-07T20:32:15.4583418Z 2025-05-07T20:32:15.4583655Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4583960Z op = silu_mul_quant 2025-05-07T20:32:15.4584200Z if compiled: 2025-05-07T20:32:15.4584452Z op = torch.compile(op) 2025-05-07T20:32:15.4584739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4585003Z 2025-05-07T20:32:15.4585194Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4585353Z 2025-05-07T20:32:15.4585456Z moe/activation_test.py:117: 2025-05-07T20:32:15.4585745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4586074Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4586348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4587024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4587779Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4588306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4588977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4589622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4590220Z kernel = self.compile( 2025-05-07T20:32:15.4590748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4591387Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4591774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4591999Z 2025-05-07T20:32:15.4592201Z self = 2025-05-07T20:32:15.4593266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4594622Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a421c3420>} 2025-05-07T20:32:15.4595940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4596936Z context = 2025-05-07T20:32:15.4597222Z 2025-05-07T20:32:15.4597386Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4597896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4598354Z module_map=module_map) 2025-05-07T20:32:15.4598713Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4599058Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4599306Z E ^ 2025-05-07T20:32:15.4599756Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4600205Z 2025-05-07T20:32:15.4600615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.4601115Z 2025-05-07T20:32:15.4601223Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.4601626Z self=, 2025-05-07T20:32:15.4602011Z T=2048, 2025-05-07T20:32:15.4602195Z D=7168, 2025-05-07T20:32:15.4602384Z scale_ub=1200.0, 2025-05-07T20:32:15.4602596Z contiguous=True, 2025-05-07T20:32:15.4602821Z compiled=False, 2025-05-07T20:32:15.4603023Z ) 2025-05-07T20:32:15.5405878Z self = 2025-05-07T20:32:15.5406601Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.5406989Z 2025-05-07T20:32:15.5407116Z @given( 2025-05-07T20:32:15.5407432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.5407868Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.5408315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.5408659Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.5408998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.5409316Z ) 2025-05-07T20:32:15.5409675Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.5410121Z def test_silu_mul_quant( 2025-05-07T20:32:15.5410369Z self, 2025-05-07T20:32:15.5410581Z T: int, 2025-05-07T20:32:15.5410965Z D: int, 2025-05-07T20:32:15.5411183Z scale_ub: Optional[float], 2025-05-07T20:32:15.5411461Z contiguous: bool, 2025-05-07T20:32:15.5411701Z compiled: bool, 2025-05-07T20:32:15.5411933Z ) -> None: 2025-05-07T20:32:15.5412150Z torch.manual_seed(2025) 2025-05-07T20:32:15.5412389Z 2025-05-07T20:32:15.5412671Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.5414910Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.5416742Z 2025-05-07T20:32:15.5416861Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.5417073Z 2025-05-07T20:32:15.5417177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5417598Z self=, 2025-05-07T20:32:15.5417994Z T=1, 2025-05-07T20:32:15.5418183Z D=5120, 2025-05-07T20:32:15.5418383Z scale_ub=1200.0, 2025-05-07T20:32:15.5418610Z contiguous=True, 2025-05-07T20:32:15.5418829Z compiled=False, 2025-05-07T20:32:15.5419036Z ) 2025-05-07T20:32:15.5419356Z self = 2025-05-07T20:32:15.5419837Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.5420100Z 2025-05-07T20:32:15.5420178Z @given( 2025-05-07T20:32:15.5420411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.5420722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.5421040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.5421366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.5421686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.5421973Z ) 2025-05-07T20:32:15.5422322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.5422760Z def test_silu_mul_quant( 2025-05-07T20:32:15.5423006Z self, 2025-05-07T20:32:15.5423201Z T: int, 2025-05-07T20:32:15.5423405Z D: int, 2025-05-07T20:32:15.5423623Z scale_ub: Optional[float], 2025-05-07T20:32:15.5423900Z contiguous: bool, 2025-05-07T20:32:15.5424139Z compiled: bool, 2025-05-07T20:32:15.5424357Z ) -> None: 2025-05-07T20:32:15.5424578Z torch.manual_seed(2025) 2025-05-07T20:32:15.5424820Z 2025-05-07T20:32:15.5425090Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.5425440Z 2025-05-07T20:32:15.5425641Z x_sign = torch.sign(x) 2025-05-07T20:32:15.5425927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.5426235Z x = x_sign * x_clamp 2025-05-07T20:32:15.5426484Z x0 = x[:, :D] 2025-05-07T20:32:15.5426702Z x1 = x[:, D:] 2025-05-07T20:32:15.5426911Z 2025-05-07T20:32:15.5427101Z if contiguous: 2025-05-07T20:32:15.5427330Z x0 = x0.contiguous() 2025-05-07T20:32:15.5427602Z x1 = x1.contiguous() 2025-05-07T20:32:15.5427850Z 2025-05-07T20:32:15.5428051Z if scale_ub is not None: 2025-05-07T20:32:15.5428327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.5428662Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.5428973Z ) 2025-05-07T20:32:15.5429168Z else: 2025-05-07T20:32:15.5429387Z scale_ub_tensor = None 2025-05-07T20:32:15.5429638Z 2025-05-07T20:32:15.5429963Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.5430283Z op = silu_mul_quant 2025-05-07T20:32:15.5430532Z if compiled: 2025-05-07T20:32:15.5430782Z op = torch.compile(op) 2025-05-07T20:32:15.5431086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5431373Z 2025-05-07T20:32:15.5431569Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.5431745Z 2025-05-07T20:32:15.5431847Z moe/activation_test.py:117: 2025-05-07T20:32:15.5432258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5432590Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.5432883Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.5433578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.5434283Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.5435030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.5435728Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.5436395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.5436930Z kernel = self.compile( 2025-05-07T20:32:15.5437482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.5438146Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.5438558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.5438787Z 2025-05-07T20:32:15.5439001Z self = 2025-05-07T20:32:15.5440084Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.5441450Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a420149a0>} 2025-05-07T20:32:15.5442783Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.5443809Z context = 2025-05-07T20:32:15.5444098Z 2025-05-07T20:32:15.5444269Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.5444793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.5445265Z module_map=module_map) 2025-05-07T20:32:15.5445640Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.5446003Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.5446275Z E ^ 2025-05-07T20:32:15.5446746Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.5447197Z 2025-05-07T20:32:15.5447611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.5448134Z 2025-05-07T20:32:15.5448243Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5448665Z self=, 2025-05-07T20:32:15.5449066Z T=2048, 2025-05-07T20:32:15.5449271Z D=5120, 2025-05-07T20:32:15.5449473Z scale_ub=None, 2025-05-07T20:32:15.5449702Z contiguous=True, 2025-05-07T20:32:15.5449931Z compiled=False, 2025-05-07T20:32:15.5450150Z ) 2025-05-07T20:32:15.5450563Z self = 2025-05-07T20:32:15.5451059Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.5451334Z 2025-05-07T20:32:15.5451422Z @given( 2025-05-07T20:32:15.5451665Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.5451985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.5452305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.5452721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.5453093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.5453391Z ) 2025-05-07T20:32:15.5453748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.5454196Z def test_silu_mul_quant( 2025-05-07T20:32:15.5454443Z self, 2025-05-07T20:32:15.5454653Z T: int, 2025-05-07T20:32:15.5454857Z D: int, 2025-05-07T20:32:15.5455078Z scale_ub: Optional[float], 2025-05-07T20:32:15.5455372Z contiguous: bool, 2025-05-07T20:32:15.5455628Z compiled: bool, 2025-05-07T20:32:15.5455852Z ) -> None: 2025-05-07T20:32:15.5456078Z torch.manual_seed(2025) 2025-05-07T20:32:15.5456325Z 2025-05-07T20:32:15.5456600Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.5456954Z 2025-05-07T20:32:15.5457158Z > x_sign = torch.sign(x) 2025-05-07T20:32:15.5459095Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.5461093Z 2025-05-07T20:32:15.5461221Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:15.5461436Z 2025-05-07T20:32:15.5461541Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5461955Z self=, 2025-05-07T20:32:15.5462364Z T=16384, 2025-05-07T20:32:15.5462560Z D=5120, 2025-05-07T20:32:15.5462763Z scale_ub=None, 2025-05-07T20:32:15.5462989Z contiguous=True, 2025-05-07T20:32:15.5463214Z compiled=False, 2025-05-07T20:32:15.5463428Z ) 2025-05-07T20:32:15.6228189Z self = 2025-05-07T20:32:15.6228917Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.6229302Z 2025-05-07T20:32:15.6229418Z @given( 2025-05-07T20:32:15.6229745Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6230183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6230627Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6230970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6231303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6231595Z ) 2025-05-07T20:32:15.6231943Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6232393Z def test_silu_mul_quant( 2025-05-07T20:32:15.6232650Z self, 2025-05-07T20:32:15.6232855Z T: int, 2025-05-07T20:32:15.6233061Z D: int, 2025-05-07T20:32:15.6233290Z scale_ub: Optional[float], 2025-05-07T20:32:15.6233561Z contiguous: bool, 2025-05-07T20:32:15.6233816Z compiled: bool, 2025-05-07T20:32:15.6234049Z ) -> None: 2025-05-07T20:32:15.6234276Z torch.manual_seed(2025) 2025-05-07T20:32:15.6234522Z 2025-05-07T20:32:15.6234809Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6237030Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.6238979Z 2025-05-07T20:32:15.6239111Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.6239321Z 2025-05-07T20:32:15.6239425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6239843Z self=, 2025-05-07T20:32:15.6240251Z T=4096, 2025-05-07T20:32:15.6240446Z D=5120, 2025-05-07T20:32:15.6240644Z scale_ub=None, 2025-05-07T20:32:15.6240865Z contiguous=True, 2025-05-07T20:32:15.6241106Z compiled=False, 2025-05-07T20:32:15.6241317Z ) 2025-05-07T20:32:15.6241651Z self = 2025-05-07T20:32:15.6242150Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.6242417Z 2025-05-07T20:32:15.6242502Z @given( 2025-05-07T20:32:15.6242742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6243070Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6243372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6243711Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6244049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6244345Z ) 2025-05-07T20:32:15.6244693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6245140Z def test_silu_mul_quant( 2025-05-07T20:32:15.6245392Z self, 2025-05-07T20:32:15.6245592Z T: int, 2025-05-07T20:32:15.6245796Z D: int, 2025-05-07T20:32:15.6246026Z scale_ub: Optional[float], 2025-05-07T20:32:15.6246299Z contiguous: bool, 2025-05-07T20:32:15.6246548Z compiled: bool, 2025-05-07T20:32:15.6246782Z ) -> None: 2025-05-07T20:32:15.6247006Z torch.manual_seed(2025) 2025-05-07T20:32:15.6247256Z 2025-05-07T20:32:15.6247534Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6249559Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.6251374Z 2025-05-07T20:32:15.6251505Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.6251719Z 2025-05-07T20:32:15.6251829Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6252252Z self=, 2025-05-07T20:32:15.6252656Z T=2048, 2025-05-07T20:32:15.6252849Z D=5120, 2025-05-07T20:32:15.6253140Z scale_ub=None, 2025-05-07T20:32:15.6253365Z contiguous=False, 2025-05-07T20:32:15.6253596Z compiled=False, 2025-05-07T20:32:15.6253810Z ) 2025-05-07T20:32:15.6254134Z self = 2025-05-07T20:32:15.6254623Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.6254903Z 2025-05-07T20:32:15.6254983Z @given( 2025-05-07T20:32:15.6255225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6255633Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6255951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6256290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6256632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6256917Z ) 2025-05-07T20:32:15.6257274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6257793Z def test_silu_mul_quant( 2025-05-07T20:32:15.6258036Z self, 2025-05-07T20:32:15.6258236Z T: int, 2025-05-07T20:32:15.6258450Z D: int, 2025-05-07T20:32:15.6258675Z scale_ub: Optional[float], 2025-05-07T20:32:15.6258960Z contiguous: bool, 2025-05-07T20:32:15.6259399Z compiled: bool, 2025-05-07T20:32:15.6259637Z ) -> None: 2025-05-07T20:32:15.6259859Z torch.manual_seed(2025) 2025-05-07T20:32:15.6260112Z 2025-05-07T20:32:15.6260406Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6262434Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.6264256Z 2025-05-07T20:32:15.6264387Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.6264605Z 2025-05-07T20:32:15.6264708Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6265129Z self=, 2025-05-07T20:32:15.6265533Z T=4096, 2025-05-07T20:32:15.6265734Z D=7168, 2025-05-07T20:32:15.6265933Z scale_ub=None, 2025-05-07T20:32:15.6266153Z contiguous=True, 2025-05-07T20:32:15.6266390Z compiled=True, 2025-05-07T20:32:15.6266598Z ) 2025-05-07T20:32:15.6266923Z self = 2025-05-07T20:32:15.6267420Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:15.6267690Z 2025-05-07T20:32:15.6267776Z @given( 2025-05-07T20:32:15.6268022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6268340Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6268648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6268983Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6269317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6269611Z ) 2025-05-07T20:32:15.6269965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6270415Z def test_silu_mul_quant( 2025-05-07T20:32:15.6270672Z self, 2025-05-07T20:32:15.6270868Z T: int, 2025-05-07T20:32:15.6271077Z D: int, 2025-05-07T20:32:15.6271302Z scale_ub: Optional[float], 2025-05-07T20:32:15.6271572Z contiguous: bool, 2025-05-07T20:32:15.6271822Z compiled: bool, 2025-05-07T20:32:15.6272063Z ) -> None: 2025-05-07T20:32:15.6272283Z torch.manual_seed(2025) 2025-05-07T20:32:15.6272550Z 2025-05-07T20:32:15.6272830Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6275003Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.6276828Z 2025-05-07T20:32:15.6276955Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.6277168Z 2025-05-07T20:32:15.6277276Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6277701Z self=, 2025-05-07T20:32:15.6278219Z T=2048, 2025-05-07T20:32:15.6285105Z D=5120, 2025-05-07T20:32:15.6285339Z scale_ub=1200.0, 2025-05-07T20:32:15.6285575Z contiguous=False, 2025-05-07T20:32:15.6285797Z compiled=False, 2025-05-07T20:32:15.6285995Z ) 2025-05-07T20:32:15.6286310Z self = 2025-05-07T20:32:15.6286810Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:15.6287083Z 2025-05-07T20:32:15.6287162Z @given( 2025-05-07T20:32:15.6287399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6287710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6288012Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6288330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6288649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6288929Z ) 2025-05-07T20:32:15.6289264Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6289702Z def test_silu_mul_quant( 2025-05-07T20:32:15.6289941Z self, 2025-05-07T20:32:15.6290125Z T: int, 2025-05-07T20:32:15.6290323Z D: int, 2025-05-07T20:32:15.6290541Z scale_ub: Optional[float], 2025-05-07T20:32:15.6290801Z contiguous: bool, 2025-05-07T20:32:15.6291034Z compiled: bool, 2025-05-07T20:32:15.6291251Z ) -> None: 2025-05-07T20:32:15.6291455Z torch.manual_seed(2025) 2025-05-07T20:32:15.6291696Z 2025-05-07T20:32:15.6291965Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6294027Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.6295853Z 2025-05-07T20:32:15.6295976Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.6296182Z 2025-05-07T20:32:15.6296283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6296688Z self=, 2025-05-07T20:32:15.6297082Z T=4096, 2025-05-07T20:32:15.6297263Z D=7168, 2025-05-07T20:32:15.6297455Z scale_ub=1200.0, 2025-05-07T20:32:15.6297673Z contiguous=True, 2025-05-07T20:32:15.6297894Z compiled=False, 2025-05-07T20:32:15.6298095Z ) 2025-05-07T20:32:15.7351818Z self = 2025-05-07T20:32:15.7353256Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.7354021Z 2025-05-07T20:32:15.7354244Z @given( 2025-05-07T20:32:15.7354814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7355447Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7356061Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7356726Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7357380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7357937Z ) 2025-05-07T20:32:15.7358536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7358983Z def test_silu_mul_quant( 2025-05-07T20:32:15.7359393Z self, 2025-05-07T20:32:15.7359593Z T: int, 2025-05-07T20:32:15.7359793Z D: int, 2025-05-07T20:32:15.7360010Z scale_ub: Optional[float], 2025-05-07T20:32:15.7360287Z contiguous: bool, 2025-05-07T20:32:15.7360534Z compiled: bool, 2025-05-07T20:32:15.7360759Z ) -> None: 2025-05-07T20:32:15.7361108Z torch.manual_seed(2025) 2025-05-07T20:32:15.7361351Z 2025-05-07T20:32:15.7361626Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7363654Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7365492Z 2025-05-07T20:32:15.7365611Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7365820Z 2025-05-07T20:32:15.7365927Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7366352Z self=, 2025-05-07T20:32:15.7366751Z T=16384, 2025-05-07T20:32:15.7366948Z D=7168, 2025-05-07T20:32:15.7367145Z scale_ub=None, 2025-05-07T20:32:15.7367353Z contiguous=False, 2025-05-07T20:32:15.7367579Z compiled=True, 2025-05-07T20:32:15.7367785Z ) 2025-05-07T20:32:15.7368099Z self = 2025-05-07T20:32:15.7368598Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:15.7368875Z 2025-05-07T20:32:15.7368959Z @given( 2025-05-07T20:32:15.7369184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7369493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7369800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7370130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7370455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7370750Z ) 2025-05-07T20:32:15.7371100Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7371539Z def test_silu_mul_quant( 2025-05-07T20:32:15.7371786Z self, 2025-05-07T20:32:15.7371988Z T: int, 2025-05-07T20:32:15.7372180Z D: int, 2025-05-07T20:32:15.7372401Z scale_ub: Optional[float], 2025-05-07T20:32:15.7372670Z contiguous: bool, 2025-05-07T20:32:15.7372914Z compiled: bool, 2025-05-07T20:32:15.7373212Z ) -> None: 2025-05-07T20:32:15.7373439Z torch.manual_seed(2025) 2025-05-07T20:32:15.7373683Z 2025-05-07T20:32:15.7373952Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7375967Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7377800Z 2025-05-07T20:32:15.7377920Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7378130Z 2025-05-07T20:32:15.7378240Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7378764Z self=, 2025-05-07T20:32:15.7379165Z T=4096, 2025-05-07T20:32:15.7379356Z D=7168, 2025-05-07T20:32:15.7379549Z scale_ub=None, 2025-05-07T20:32:15.7379773Z contiguous=True, 2025-05-07T20:32:15.7380001Z compiled=False, 2025-05-07T20:32:15.7380206Z ) 2025-05-07T20:32:15.7380529Z self = 2025-05-07T20:32:15.7381096Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.7381365Z 2025-05-07T20:32:15.7381454Z @given( 2025-05-07T20:32:15.7381686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7381995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7382302Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7382625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7382957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7383250Z ) 2025-05-07T20:32:15.7383599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7384037Z def test_silu_mul_quant( 2025-05-07T20:32:15.7384284Z self, 2025-05-07T20:32:15.7384484Z T: int, 2025-05-07T20:32:15.7384684Z D: int, 2025-05-07T20:32:15.7384912Z scale_ub: Optional[float], 2025-05-07T20:32:15.7385188Z contiguous: bool, 2025-05-07T20:32:15.7385436Z compiled: bool, 2025-05-07T20:32:15.7385660Z ) -> None: 2025-05-07T20:32:15.7385878Z torch.manual_seed(2025) 2025-05-07T20:32:15.7386119Z 2025-05-07T20:32:15.7386392Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7388463Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7390279Z 2025-05-07T20:32:15.7390405Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7390618Z 2025-05-07T20:32:15.7390731Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7391136Z self=, 2025-05-07T20:32:15.7391532Z T=16384, 2025-05-07T20:32:15.7391731Z D=7168, 2025-05-07T20:32:15.7391920Z scale_ub=None, 2025-05-07T20:32:15.7392139Z contiguous=True, 2025-05-07T20:32:15.7392368Z compiled=False, 2025-05-07T20:32:15.7392573Z ) 2025-05-07T20:32:15.7392893Z self = 2025-05-07T20:32:15.7393390Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.7393663Z 2025-05-07T20:32:15.7393741Z @given( 2025-05-07T20:32:15.7393971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7394281Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7394587Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7394911Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7395250Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7395546Z ) 2025-05-07T20:32:15.7395894Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7396343Z def test_silu_mul_quant( 2025-05-07T20:32:15.7396581Z self, 2025-05-07T20:32:15.7396774Z T: int, 2025-05-07T20:32:15.7396980Z D: int, 2025-05-07T20:32:15.7397203Z scale_ub: Optional[float], 2025-05-07T20:32:15.7397469Z contiguous: bool, 2025-05-07T20:32:15.7397797Z compiled: bool, 2025-05-07T20:32:15.7398019Z ) -> None: 2025-05-07T20:32:15.7398241Z torch.manual_seed(2025) 2025-05-07T20:32:15.7398484Z 2025-05-07T20:32:15.7398751Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7400762Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7402660Z 2025-05-07T20:32:15.7402780Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7402992Z 2025-05-07T20:32:15.7403101Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7403514Z self=, 2025-05-07T20:32:15.7403911Z T=16384, 2025-05-07T20:32:15.7404105Z D=7168, 2025-05-07T20:32:15.7404300Z scale_ub=1200.0, 2025-05-07T20:32:15.7404522Z contiguous=True, 2025-05-07T20:32:15.7404749Z compiled=False, 2025-05-07T20:32:15.7404956Z ) 2025-05-07T20:32:15.7405277Z self = 2025-05-07T20:32:15.7405770Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.7406046Z 2025-05-07T20:32:15.7406130Z @given( 2025-05-07T20:32:15.7406358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.7406664Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.7406967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.7407292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.7407617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.7407904Z ) 2025-05-07T20:32:15.7408253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.7408689Z def test_silu_mul_quant( 2025-05-07T20:32:15.7408927Z self, 2025-05-07T20:32:15.7409126Z T: int, 2025-05-07T20:32:15.7409325Z D: int, 2025-05-07T20:32:15.7409537Z scale_ub: Optional[float], 2025-05-07T20:32:15.7409814Z contiguous: bool, 2025-05-07T20:32:15.7410056Z compiled: bool, 2025-05-07T20:32:15.7410275Z ) -> None: 2025-05-07T20:32:15.7410494Z torch.manual_seed(2025) 2025-05-07T20:32:15.7410735Z 2025-05-07T20:32:15.7410997Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.7413057Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.7414888Z 2025-05-07T20:32:15.7415010Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.7415219Z 2025-05-07T20:32:15.7415322Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.7415729Z self=, 2025-05-07T20:32:15.7416125Z T=128, 2025-05-07T20:32:15.7416312Z D=5120, 2025-05-07T20:32:15.7416502Z scale_ub=1200.0, 2025-05-07T20:32:15.7416721Z contiguous=False, 2025-05-07T20:32:15.7416954Z compiled=False, 2025-05-07T20:32:15.7417163Z ) 2025-05-07T20:32:15.8687677Z self = 2025-05-07T20:32:15.8688899Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:15.8689647Z 2025-05-07T20:32:15.8689864Z @given( 2025-05-07T20:32:15.8690490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.8691323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.8692119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.8693212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.8693847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.8694404Z ) 2025-05-07T20:32:15.8695091Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.8695960Z def test_silu_mul_quant( 2025-05-07T20:32:15.8696424Z self, 2025-05-07T20:32:15.8696809Z T: int, 2025-05-07T20:32:15.8697202Z D: int, 2025-05-07T20:32:15.8697620Z scale_ub: Optional[float], 2025-05-07T20:32:15.8698041Z contiguous: bool, 2025-05-07T20:32:15.8698280Z compiled: bool, 2025-05-07T20:32:15.8698499Z ) -> None: 2025-05-07T20:32:15.8698713Z torch.manual_seed(2025) 2025-05-07T20:32:15.8698950Z 2025-05-07T20:32:15.8699214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.8699551Z 2025-05-07T20:32:15.8699746Z x_sign = torch.sign(x) 2025-05-07T20:32:15.8700035Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.8700343Z x = x_sign * x_clamp 2025-05-07T20:32:15.8700590Z x0 = x[:, :D] 2025-05-07T20:32:15.8700802Z x1 = x[:, D:] 2025-05-07T20:32:15.8701009Z 2025-05-07T20:32:15.8701196Z if contiguous: 2025-05-07T20:32:15.8701423Z x0 = x0.contiguous() 2025-05-07T20:32:15.8701682Z x1 = x1.contiguous() 2025-05-07T20:32:15.8701928Z 2025-05-07T20:32:15.8702121Z if scale_ub is not None: 2025-05-07T20:32:15.8702400Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.8702739Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.8703046Z ) 2025-05-07T20:32:15.8703239Z else: 2025-05-07T20:32:15.8703458Z scale_ub_tensor = None 2025-05-07T20:32:15.8703716Z 2025-05-07T20:32:15.8703949Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.8704262Z op = silu_mul_quant 2025-05-07T20:32:15.8704520Z if compiled: 2025-05-07T20:32:15.8704766Z op = torch.compile(op) 2025-05-07T20:32:15.8705060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.8705330Z 2025-05-07T20:32:15.8705517Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.8705691Z 2025-05-07T20:32:15.8705790Z moe/activation_test.py:117: 2025-05-07T20:32:15.8706089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.8706412Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.8706701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.8707386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.8708084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.8708679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.8709355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.8710012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.8710542Z kernel = self.compile( 2025-05-07T20:32:15.8711083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.8711726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.8712209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.8712439Z 2025-05-07T20:32:15.8712649Z self = 2025-05-07T20:32:15.8713714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.8715141Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a41fc7560>} 2025-05-07T20:32:15.8716464Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.8717476Z context = 2025-05-07T20:32:15.8717762Z 2025-05-07T20:32:15.8717937Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.8718497Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.8718961Z module_map=module_map) 2025-05-07T20:32:15.8719326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.8719680Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.8719941Z E ^ 2025-05-07T20:32:15.8720400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.8720841Z 2025-05-07T20:32:15.8721259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.8721760Z 2025-05-07T20:32:15.8721866Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.8722285Z self=, 2025-05-07T20:32:15.8722687Z T=2048, 2025-05-07T20:32:15.8722879Z D=7168, 2025-05-07T20:32:15.8723064Z scale_ub=None, 2025-05-07T20:32:15.8723281Z contiguous=False, 2025-05-07T20:32:15.8723503Z compiled=False, 2025-05-07T20:32:15.8723704Z ) 2025-05-07T20:32:15.8724023Z self = 2025-05-07T20:32:15.8724516Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.8724786Z 2025-05-07T20:32:15.8724870Z @given( 2025-05-07T20:32:15.8725103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.8725417Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.8725719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.8726050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.8726381Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.8726667Z ) 2025-05-07T20:32:15.8727018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.8727459Z def test_silu_mul_quant( 2025-05-07T20:32:15.8727703Z self, 2025-05-07T20:32:15.8727896Z T: int, 2025-05-07T20:32:15.8728101Z D: int, 2025-05-07T20:32:15.8728324Z scale_ub: Optional[float], 2025-05-07T20:32:15.8728595Z contiguous: bool, 2025-05-07T20:32:15.8728845Z compiled: bool, 2025-05-07T20:32:15.8729073Z ) -> None: 2025-05-07T20:32:15.8729288Z torch.manual_seed(2025) 2025-05-07T20:32:15.8729531Z 2025-05-07T20:32:15.8729805Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.8731904Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.8733780Z 2025-05-07T20:32:15.8733902Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:15.8734109Z 2025-05-07T20:32:15.8734208Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.8734694Z self=, 2025-05-07T20:32:15.8735086Z T=128, 2025-05-07T20:32:15.8735269Z D=7168, 2025-05-07T20:32:15.8735462Z scale_ub=1200.0, 2025-05-07T20:32:15.8735682Z contiguous=True, 2025-05-07T20:32:15.8735899Z compiled=True, 2025-05-07T20:32:15.8736107Z ) 2025-05-07T20:32:15.9043014Z self = 2025-05-07T20:32:15.9043786Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.9044154Z 2025-05-07T20:32:15.9044267Z @given( 2025-05-07T20:32:15.9044572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9044993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9045414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9045740Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9046072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9046370Z ) 2025-05-07T20:32:15.9046718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9047158Z def test_silu_mul_quant( 2025-05-07T20:32:15.9047398Z self, 2025-05-07T20:32:15.9047589Z T: int, 2025-05-07T20:32:15.9047793Z D: int, 2025-05-07T20:32:15.9048013Z scale_ub: Optional[float], 2025-05-07T20:32:15.9048313Z contiguous: bool, 2025-05-07T20:32:15.9048573Z compiled: bool, 2025-05-07T20:32:15.9048808Z ) -> None: 2025-05-07T20:32:15.9049022Z torch.manual_seed(2025) 2025-05-07T20:32:15.9049263Z 2025-05-07T20:32:15.9049534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9049872Z 2025-05-07T20:32:15.9050063Z x_sign = torch.sign(x) 2025-05-07T20:32:15.9050356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.9050667Z x = x_sign * x_clamp 2025-05-07T20:32:15.9050913Z x0 = x[:, :D] 2025-05-07T20:32:15.9051137Z x1 = x[:, D:] 2025-05-07T20:32:15.9051354Z 2025-05-07T20:32:15.9051541Z if contiguous: 2025-05-07T20:32:15.9051775Z x0 = x0.contiguous() 2025-05-07T20:32:15.9052041Z x1 = x1.contiguous() 2025-05-07T20:32:15.9052284Z 2025-05-07T20:32:15.9052482Z if scale_ub is not None: 2025-05-07T20:32:15.9052759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.9053146Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.9053461Z ) 2025-05-07T20:32:15.9053662Z else: 2025-05-07T20:32:15.9053873Z scale_ub_tensor = None 2025-05-07T20:32:15.9054118Z 2025-05-07T20:32:15.9054351Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.9054668Z op = silu_mul_quant 2025-05-07T20:32:15.9054918Z if compiled: 2025-05-07T20:32:15.9055169Z op = torch.compile(op) 2025-05-07T20:32:15.9055471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9055748Z 2025-05-07T20:32:15.9055943Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.9056103Z 2025-05-07T20:32:15.9056209Z moe/activation_test.py:117: 2025-05-07T20:32:15.9056506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9056834Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.9057118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9057831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.9058390Z return fn(*args, **kwargs) 2025-05-07T20:32:15.9059044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.9059907Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.9060432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.9061232Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.9061890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.9062415Z kernel = self.compile( 2025-05-07T20:32:15.9062945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.9063595Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.9063992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9064216Z 2025-05-07T20:32:15.9064427Z self = 2025-05-07T20:32:15.9065491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.9066846Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a41e30e00>} 2025-05-07T20:32:15.9068160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.9069168Z context = 2025-05-07T20:32:15.9069450Z 2025-05-07T20:32:15.9069615Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.9070127Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.9070593Z module_map=module_map) 2025-05-07T20:32:15.9070957Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.9071307Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.9078516Z E ^ 2025-05-07T20:32:15.9079022Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.9079473Z 2025-05-07T20:32:15.9079889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.9080393Z 2025-05-07T20:32:15.9080500Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9080917Z self=, 2025-05-07T20:32:15.9081308Z T=128, 2025-05-07T20:32:15.9081490Z D=7168, 2025-05-07T20:32:15.9081698Z scale_ub=1200.0, 2025-05-07T20:32:15.9081913Z contiguous=True, 2025-05-07T20:32:15.9082133Z compiled=False, 2025-05-07T20:32:15.9082330Z ) 2025-05-07T20:32:15.9082639Z self = 2025-05-07T20:32:15.9083123Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.9083389Z 2025-05-07T20:32:15.9083468Z @given( 2025-05-07T20:32:15.9083686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9083988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9084288Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9084610Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9084924Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9085362Z ) 2025-05-07T20:32:15.9085710Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9086133Z def test_silu_mul_quant( 2025-05-07T20:32:15.9086367Z self, 2025-05-07T20:32:15.9086554Z T: int, 2025-05-07T20:32:15.9086745Z D: int, 2025-05-07T20:32:15.9086960Z scale_ub: Optional[float], 2025-05-07T20:32:15.9087220Z contiguous: bool, 2025-05-07T20:32:15.9087537Z compiled: bool, 2025-05-07T20:32:15.9087751Z ) -> None: 2025-05-07T20:32:15.9087958Z torch.manual_seed(2025) 2025-05-07T20:32:15.9088198Z 2025-05-07T20:32:15.9088461Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9088799Z 2025-05-07T20:32:15.9088983Z x_sign = torch.sign(x) 2025-05-07T20:32:15.9089267Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.9091250Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.9093211Z 2025-05-07T20:32:15.9093331Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:15.9093537Z 2025-05-07T20:32:15.9093639Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9094037Z self=, 2025-05-07T20:32:15.9094422Z T=128, 2025-05-07T20:32:15.9094604Z D=5120, 2025-05-07T20:32:15.9094790Z scale_ub=1200.0, 2025-05-07T20:32:15.9095002Z contiguous=True, 2025-05-07T20:32:15.9095224Z compiled=True, 2025-05-07T20:32:15.9095419Z ) 2025-05-07T20:32:15.9095724Z self = 2025-05-07T20:32:15.9096197Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.9096457Z 2025-05-07T20:32:15.9096536Z @given( 2025-05-07T20:32:15.9096753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9097065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9097362Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9097686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9098008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9098332Z ) 2025-05-07T20:32:15.9098673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9099097Z def test_silu_mul_quant( 2025-05-07T20:32:15.9099330Z self, 2025-05-07T20:32:15.9099519Z T: int, 2025-05-07T20:32:15.9099711Z D: int, 2025-05-07T20:32:15.9099927Z scale_ub: Optional[float], 2025-05-07T20:32:15.9100189Z contiguous: bool, 2025-05-07T20:32:15.9100414Z compiled: bool, 2025-05-07T20:32:15.9100624Z ) -> None: 2025-05-07T20:32:15.9100831Z torch.manual_seed(2025) 2025-05-07T20:32:15.9101062Z 2025-05-07T20:32:15.9101325Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9101664Z 2025-05-07T20:32:15.9101847Z > x_sign = torch.sign(x) 2025-05-07T20:32:15.9103832Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:15.9105647Z 2025-05-07T20:32:15.9105764Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:15.9105974Z 2025-05-07T20:32:15.9106073Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9106476Z self=, 2025-05-07T20:32:15.9106934Z T=128, 2025-05-07T20:32:15.9107117Z D=7168, 2025-05-07T20:32:15.9107304Z scale_ub=None, 2025-05-07T20:32:15.9107509Z contiguous=True, 2025-05-07T20:32:15.9107724Z compiled=True, 2025-05-07T20:32:15.9107921Z ) 2025-05-07T20:32:16.2405075Z self = 2025-05-07T20:32:16.2405578Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.2405908Z 2025-05-07T20:32:16.2406028Z @given( 2025-05-07T20:32:16.2406369Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2406796Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2407200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2407591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2407916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2408203Z ) 2025-05-07T20:32:16.2408554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2408997Z def test_silu_mul_quant( 2025-05-07T20:32:16.2409234Z self, 2025-05-07T20:32:16.2409432Z T: int, 2025-05-07T20:32:16.2409630Z D: int, 2025-05-07T20:32:16.2409846Z scale_ub: Optional[float], 2025-05-07T20:32:16.2410116Z contiguous: bool, 2025-05-07T20:32:16.2410358Z compiled: bool, 2025-05-07T20:32:16.2410583Z ) -> None: 2025-05-07T20:32:16.2410796Z torch.manual_seed(2025) 2025-05-07T20:32:16.2411041Z 2025-05-07T20:32:16.2411317Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2413427Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.2415272Z 2025-05-07T20:32:16.2415391Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.2415602Z 2025-05-07T20:32:16.2500818Z FAILED 2025-05-07T20:32:16.2501124Z 2025-05-07T20:32:16.2501334Z =================================== FAILURES =================================== 2025-05-07T20:32:16.2501810Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:16.2502361Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:16.2503218Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:16.2503798Z | yield 2025-05-07T20:32:16.2504241Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:32:16.2504761Z | self._callTestMethod(testMethod) 2025-05-07T20:32:16.2505332Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:32:16.2505878Z | if method() is not None: 2025-05-07T20:32:16.2506127Z | ^^^^^^^^ 2025-05-07T20:32:16.2506757Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:16.2507836Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2508141Z | ^^^^^^^ 2025-05-07T20:32:16.2508700Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:16.2509306Z | raise the_error_hypothesis_found 2025-05-07T20:32:16.2509724Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:16.2510261Z +-+---------------- 1 ---------------- 2025-05-07T20:32:16.2510546Z | Traceback (most recent call last): 2025-05-07T20:32:16.2511232Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:16.2511997Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2512381Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.2514964Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.2517065Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:16.2517509Z | self=, 2025-05-07T20:32:16.2517909Z | T=128, 2025-05-07T20:32:16.2518122Z | D=7168, 2025-05-07T20:32:16.2518344Z | scale_ub=1200.0, 2025-05-07T20:32:16.2518597Z | contiguous=True, 2025-05-07T20:32:16.2518844Z | compiled=False, 2025-05-07T20:32:16.2519089Z | ) 2025-05-07T20:32:16.2519272Z | 2025-05-07T20:32:16.2519924Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:32:16.2520722Z +---------------- 2 ---------------- 2025-05-07T20:32:16.2521110Z | Traceback (most recent call last): 2025-05-07T20:32:16.2522071Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:16.2523122Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2523626Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.2526338Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.2528855Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:16.2529291Z | self=, 2025-05-07T20:32:16.2529699Z | T=128, 2025-05-07T20:32:16.2529905Z | D=7168, 2025-05-07T20:32:16.2530111Z | scale_ub=None, 2025-05-07T20:32:16.2530353Z | contiguous=True, 2025-05-07T20:32:16.2530595Z | compiled=True, 2025-05-07T20:32:16.2530818Z | ) 2025-05-07T20:32:16.2530999Z | 2025-05-07T20:32:16.2531524Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:16.2532244Z +---------------- 3 ---------------- 2025-05-07T20:32:16.2532538Z | Traceback (most recent call last): 2025-05-07T20:32:16.2533341Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:16.2534107Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2534559Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.2536512Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.2538445Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:16.2538881Z | self=, 2025-05-07T20:32:16.2539281Z | T=128, 2025-05-07T20:32:16.2539482Z | D=5120, 2025-05-07T20:32:16.2539694Z | scale_ub=1200.0, 2025-05-07T20:32:16.2539946Z | contiguous=True, 2025-05-07T20:32:16.2540182Z | compiled=True, 2025-05-07T20:32:16.2540409Z | ) 2025-05-07T20:32:16.2540598Z | 2025-05-07T20:32:16.2541113Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:16.2541711Z +---------------- 4 ---------------- 2025-05-07T20:32:16.2542194Z | Traceback (most recent call last): 2025-05-07T20:32:16.2542901Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:16.2543603Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.2543890Z | ^^^^^^^^ 2025-05-07T20:32:16.2544525Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:16.2545208Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2545547Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.2546332Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:16.2547160Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.2547772Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:16.2548710Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2549306Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.2550172Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:16.2551218Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.2551869Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.2552739Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:16.2553722Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.2554233Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.2555158Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:16.2555935Z | fn() 2025-05-07T20:32:16.2556705Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:16.2557574Z | self.fn.run( 2025-05-07T20:32:16.2558383Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:16.2559398Z | kernel = self.compile( 2025-05-07T20:32:16.2559764Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:16.2560577Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:16.2561526Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2562056Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.2562929Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:16.2564034Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2564681Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:16.2565204Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2565687Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.2566044Z | ^ 2025-05-07T20:32:16.2566676Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2567429Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:16.2567878Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:16.2568400Z | self=, 2025-05-07T20:32:16.2568840Z | T=1, # or any other generated value 2025-05-07T20:32:16.2569153Z | D=5120, # or any other generated value 2025-05-07T20:32:16.2569498Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:16.2569862Z | contiguous=True, # or any other generated value 2025-05-07T20:32:16.2570226Z | compiled=True, # or any other generated value 2025-05-07T20:32:16.2570533Z | ) 2025-05-07T20:32:16.2570722Z | 2025-05-07T20:32:16.2571240Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:16.2571845Z +------------------------------------ 2025-05-07T20:32:16.2572208Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:16.2572587Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2573085Z self=, 2025-05-07T20:32:16.2573485Z T=1, 2025-05-07T20:32:16.2573671Z D=5120, 2025-05-07T20:32:16.2573864Z scale_ub=None, 2025-05-07T20:32:16.2574090Z contiguous=True, 2025-05-07T20:32:16.2574318Z compiled=True, 2025-05-07T20:32:16.2574525Z ) 2025-05-07T20:32:16.2574846Z self = 2025-05-07T20:32:16.2575333Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.2575588Z 2025-05-07T20:32:16.2575669Z @given( 2025-05-07T20:32:16.2575907Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2576226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2576535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2576861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2577192Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2577666Z ) 2025-05-07T20:32:16.2578013Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2578446Z def test_silu_mul_quant( 2025-05-07T20:32:16.2578689Z self, 2025-05-07T20:32:16.2578877Z T: int, 2025-05-07T20:32:16.2579073Z D: int, 2025-05-07T20:32:16.2579292Z scale_ub: Optional[float], 2025-05-07T20:32:16.2579554Z contiguous: bool, 2025-05-07T20:32:16.2579913Z compiled: bool, 2025-05-07T20:32:16.2580137Z ) -> None: 2025-05-07T20:32:16.2580346Z torch.manual_seed(2025) 2025-05-07T20:32:16.2580591Z 2025-05-07T20:32:16.2580864Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2581197Z 2025-05-07T20:32:16.2581394Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2581684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2581991Z x = x_sign * x_clamp 2025-05-07T20:32:16.2582227Z x0 = x[:, :D] 2025-05-07T20:32:16.2582450Z x1 = x[:, D:] 2025-05-07T20:32:16.2582662Z 2025-05-07T20:32:16.2582848Z if contiguous: 2025-05-07T20:32:16.2583081Z x0 = x0.contiguous() 2025-05-07T20:32:16.2583342Z x1 = x1.contiguous() 2025-05-07T20:32:16.2583575Z 2025-05-07T20:32:16.2583769Z if scale_ub is not None: 2025-05-07T20:32:16.2584040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2584374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2584682Z ) 2025-05-07T20:32:16.2584878Z else: 2025-05-07T20:32:16.2585082Z scale_ub_tensor = None 2025-05-07T20:32:16.2585334Z 2025-05-07T20:32:16.2585560Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2585862Z op = silu_mul_quant 2025-05-07T20:32:16.2586114Z if compiled: 2025-05-07T20:32:16.2586358Z op = torch.compile(op) 2025-05-07T20:32:16.2586656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2587005Z 2025-05-07T20:32:16.2587263Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.2587653Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.2588042Z 2025-05-07T20:32:16.2588368Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2588844Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.2589261Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.2589738Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.2590262Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2590668Z 2025-05-07T20:32:16.2590946Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.2591177Z 2025-05-07T20:32:16.2591279Z moe/activation_test.py:126: 2025-05-07T20:32:16.2591604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2592064Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.2592537Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2593618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.2594626Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.2595368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2596692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2597639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.2598595Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.2599601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.2600610Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.2601452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.2602154Z fn() 2025-05-07T20:32:16.2602853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.2603628Z self.fn.run( 2025-05-07T20:32:16.2604329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2605059Z kernel = self.compile( 2025-05-07T20:32:16.2605795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2629063Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2629630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2629947Z 2025-05-07T20:32:16.2630251Z self = 2025-05-07T20:32:16.2631713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2633564Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a7053bec0>} 2025-05-07T20:32:16.2635346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2636705Z context = 2025-05-07T20:32:16.2637101Z 2025-05-07T20:32:16.2637343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2638082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2638762Z module_map=module_map) 2025-05-07T20:32:16.2639238Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2639708Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.2640085Z E ^ 2025-05-07T20:32:16.2640734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2641349Z 2025-05-07T20:32:16.2641922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2642639Z 2025-05-07T20:32:16.2642782Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2643355Z self=, 2025-05-07T20:32:16.2643898Z T=2048, 2025-05-07T20:32:16.2644156Z D=5120, 2025-05-07T20:32:16.2644436Z scale_ub=1200.0, 2025-05-07T20:32:16.2644742Z contiguous=True, 2025-05-07T20:32:16.2645049Z compiled=False, 2025-05-07T20:32:16.2645341Z ) 2025-05-07T20:32:16.2645785Z self = 2025-05-07T20:32:16.2646461Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.2646844Z 2025-05-07T20:32:16.2646959Z @given( 2025-05-07T20:32:16.2647284Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2647717Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2648125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2648565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2648998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2649369Z ) 2025-05-07T20:32:16.2649844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2650586Z def test_silu_mul_quant( 2025-05-07T20:32:16.2650902Z self, 2025-05-07T20:32:16.2651161Z T: int, 2025-05-07T20:32:16.2651420Z D: int, 2025-05-07T20:32:16.2651703Z scale_ub: Optional[float], 2025-05-07T20:32:16.2652063Z contiguous: bool, 2025-05-07T20:32:16.2652374Z compiled: bool, 2025-05-07T20:32:16.2652658Z ) -> None: 2025-05-07T20:32:16.2652938Z torch.manual_seed(2025) 2025-05-07T20:32:16.2653485Z 2025-05-07T20:32:16.2653841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2654280Z 2025-05-07T20:32:16.2654517Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2654877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2655243Z x = x_sign * x_clamp 2025-05-07T20:32:16.2655538Z x0 = x[:, :D] 2025-05-07T20:32:16.2655800Z x1 = x[:, D:] 2025-05-07T20:32:16.2656047Z 2025-05-07T20:32:16.2656295Z if contiguous: 2025-05-07T20:32:16.2656634Z x0 = x0.contiguous() 2025-05-07T20:32:16.2656969Z x1 = x1.contiguous() 2025-05-07T20:32:16.2657308Z 2025-05-07T20:32:16.2657587Z if scale_ub is not None: 2025-05-07T20:32:16.2657962Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2658433Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2658844Z ) 2025-05-07T20:32:16.2659082Z else: 2025-05-07T20:32:16.2659828Z scale_ub_tensor = None 2025-05-07T20:32:16.2660181Z 2025-05-07T20:32:16.2660495Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2660948Z op = silu_mul_quant 2025-05-07T20:32:16.2661306Z if compiled: 2025-05-07T20:32:16.2661657Z op = torch.compile(op) 2025-05-07T20:32:16.2662071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2662421Z 2025-05-07T20:32:16.2662685Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.2662905Z 2025-05-07T20:32:16.2663049Z moe/activation_test.py:117: 2025-05-07T20:32:16.2663476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2663908Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.2664274Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2665239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.2666213Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.2666959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2667865Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2668823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2669563Z kernel = self.compile( 2025-05-07T20:32:16.2670234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2671069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2671606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2671884Z 2025-05-07T20:32:16.2672143Z self = 2025-05-07T20:32:16.2673486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2675204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a704dce00>} 2025-05-07T20:32:16.2677144Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2678460Z context = 2025-05-07T20:32:16.2678808Z 2025-05-07T20:32:16.2679016Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2679649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2680380Z module_map=module_map) 2025-05-07T20:32:16.2680822Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2681248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.2681608Z E ^ 2025-05-07T20:32:16.2682209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2682771Z 2025-05-07T20:32:16.2683302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2683925Z 2025-05-07T20:32:16.2684050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2684562Z self=, 2025-05-07T20:32:16.2685081Z T=2048, 2025-05-07T20:32:16.2685315Z D=5120, 2025-05-07T20:32:16.2685553Z scale_ub=1200.0, 2025-05-07T20:32:16.2685826Z contiguous=True, 2025-05-07T20:32:16.2686103Z compiled=True, 2025-05-07T20:32:16.2686397Z ) 2025-05-07T20:32:16.2686837Z self = 2025-05-07T20:32:16.2687493Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.2687855Z 2025-05-07T20:32:16.2687963Z @given( 2025-05-07T20:32:16.2688274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2688699Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2689112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2689559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2690011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2690392Z ) 2025-05-07T20:32:16.2690877Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2691473Z def test_silu_mul_quant( 2025-05-07T20:32:16.2691804Z self, 2025-05-07T20:32:16.2692069Z T: int, 2025-05-07T20:32:16.2692343Z D: int, 2025-05-07T20:32:16.2692643Z scale_ub: Optional[float], 2025-05-07T20:32:16.2693125Z contiguous: bool, 2025-05-07T20:32:16.2693470Z compiled: bool, 2025-05-07T20:32:16.2693787Z ) -> None: 2025-05-07T20:32:16.2694087Z torch.manual_seed(2025) 2025-05-07T20:32:16.2694436Z 2025-05-07T20:32:16.2694812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2695277Z 2025-05-07T20:32:16.2695554Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2695971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2696398Z x = x_sign * x_clamp 2025-05-07T20:32:16.2696736Z x0 = x[:, :D] 2025-05-07T20:32:16.2697039Z x1 = x[:, D:] 2025-05-07T20:32:16.2697322Z 2025-05-07T20:32:16.2697583Z if contiguous: 2025-05-07T20:32:16.2697905Z x0 = x0.contiguous() 2025-05-07T20:32:16.2698305Z x1 = x1.contiguous() 2025-05-07T20:32:16.2698647Z 2025-05-07T20:32:16.2698920Z if scale_ub is not None: 2025-05-07T20:32:16.2699308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2699759Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2700185Z ) 2025-05-07T20:32:16.2700452Z else: 2025-05-07T20:32:16.2700747Z scale_ub_tensor = None 2025-05-07T20:32:16.2701087Z 2025-05-07T20:32:16.2701405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2701944Z op = silu_mul_quant 2025-05-07T20:32:16.2702287Z if compiled: 2025-05-07T20:32:16.2702602Z op = torch.compile(op) 2025-05-07T20:32:16.2703002Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2703391Z 2025-05-07T20:32:16.2703662Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.2704044Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.2704442Z 2025-05-07T20:32:16.2704880Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2705354Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.2705765Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.2706196Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.2706670Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2707072Z 2025-05-07T20:32:16.2707351Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.2707618Z 2025-05-07T20:32:16.2707761Z moe/activation_test.py:126: 2025-05-07T20:32:16.2708187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2708668Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.2709106Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2710191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.2711234Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.2711977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2712910Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2713855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.2714834Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.2715832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.2716703Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.2717529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.2718263Z fn() 2025-05-07T20:32:16.2718992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.2719786Z self.fn.run( 2025-05-07T20:32:16.2720424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2721144Z kernel = self.compile( 2025-05-07T20:32:16.2721868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2722726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2723248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2723557Z 2025-05-07T20:32:16.2723835Z self = 2025-05-07T20:32:16.2725333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2727202Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a701ed440>} 2025-05-07T20:32:16.2729061Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2731452Z context = 2025-05-07T20:32:16.2731871Z 2025-05-07T20:32:16.2732107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2732827Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2733588Z module_map=module_map) 2025-05-07T20:32:16.2734258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2734770Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.2735137Z E ^ 2025-05-07T20:32:16.2735769Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2736390Z 2025-05-07T20:32:16.2736957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2737655Z 2025-05-07T20:32:16.2737816Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2738438Z self=, 2025-05-07T20:32:16.2738991Z T=16384, 2025-05-07T20:32:16.2739257Z D=7168, 2025-05-07T20:32:16.2739518Z scale_ub=1200.0, 2025-05-07T20:32:16.2739826Z contiguous=False, 2025-05-07T20:32:16.2740106Z compiled=False, 2025-05-07T20:32:16.2740361Z ) 2025-05-07T20:32:16.2740774Z self = 2025-05-07T20:32:16.2741454Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.2741830Z 2025-05-07T20:32:16.2741930Z @given( 2025-05-07T20:32:16.2742217Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2742643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2743044Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2743447Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2743864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2744225Z ) 2025-05-07T20:32:16.2744641Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2745199Z def test_silu_mul_quant( 2025-05-07T20:32:16.2745501Z self, 2025-05-07T20:32:16.2745756Z T: int, 2025-05-07T20:32:16.2746003Z D: int, 2025-05-07T20:32:16.2746307Z scale_ub: Optional[float], 2025-05-07T20:32:16.2746666Z contiguous: bool, 2025-05-07T20:32:16.2746970Z compiled: bool, 2025-05-07T20:32:16.2747274Z ) -> None: 2025-05-07T20:32:16.2747543Z torch.manual_seed(2025) 2025-05-07T20:32:16.2747879Z 2025-05-07T20:32:16.2748285Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2748786Z 2025-05-07T20:32:16.2749049Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2749448Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2749887Z x = x_sign * x_clamp 2025-05-07T20:32:16.2750223Z x0 = x[:, :D] 2025-05-07T20:32:16.2750528Z x1 = x[:, D:] 2025-05-07T20:32:16.2750812Z 2025-05-07T20:32:16.2751067Z if contiguous: 2025-05-07T20:32:16.2751395Z x0 = x0.contiguous() 2025-05-07T20:32:16.2751751Z x1 = x1.contiguous() 2025-05-07T20:32:16.2752075Z 2025-05-07T20:32:16.2752348Z if scale_ub is not None: 2025-05-07T20:32:16.2752730Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2753206Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2753627Z ) 2025-05-07T20:32:16.2753901Z else: 2025-05-07T20:32:16.2754202Z scale_ub_tensor = None 2025-05-07T20:32:16.2754552Z 2025-05-07T20:32:16.2754876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2755314Z op = silu_mul_quant 2025-05-07T20:32:16.2755660Z if compiled: 2025-05-07T20:32:16.2756013Z op = torch.compile(op) 2025-05-07T20:32:16.2756535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2756917Z 2025-05-07T20:32:16.2757191Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.2757423Z 2025-05-07T20:32:16.2757567Z moe/activation_test.py:117: 2025-05-07T20:32:16.2757968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2758419Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.2758911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2760069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.2761001Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.2761734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2762683Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2763564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2764277Z kernel = self.compile( 2025-05-07T20:32:16.2765025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2765918Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2766463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2766793Z 2025-05-07T20:32:16.2767086Z self = 2025-05-07T20:32:16.2768610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2770466Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5a149940>} 2025-05-07T20:32:16.2772265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2773750Z context = 2025-05-07T20:32:16.2774160Z 2025-05-07T20:32:16.2774389Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2775116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2775769Z module_map=module_map) 2025-05-07T20:32:16.2776254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2776719Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.2777066Z E ^ 2025-05-07T20:32:16.2777706Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2778379Z 2025-05-07T20:32:16.2778959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2779665Z 2025-05-07T20:32:16.2779814Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2780385Z self=, 2025-05-07T20:32:16.2780939Z T=1, 2025-05-07T20:32:16.2781192Z D=7168, 2025-05-07T20:32:16.2781448Z scale_ub=None, 2025-05-07T20:32:16.2781739Z contiguous=True, 2025-05-07T20:32:16.2782044Z compiled=True, 2025-05-07T20:32:16.2782320Z ) 2025-05-07T20:32:16.2782740Z self = 2025-05-07T20:32:16.2783387Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.2783737Z 2025-05-07T20:32:16.2783851Z @given( 2025-05-07T20:32:16.2784380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2784822Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2785250Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2785712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2786165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2786560Z ) 2025-05-07T20:32:16.2787045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2787789Z def test_silu_mul_quant( 2025-05-07T20:32:16.2788130Z self, 2025-05-07T20:32:16.2788401Z T: int, 2025-05-07T20:32:16.2788670Z D: int, 2025-05-07T20:32:16.2788986Z scale_ub: Optional[float], 2025-05-07T20:32:16.2789366Z contiguous: bool, 2025-05-07T20:32:16.2789681Z compiled: bool, 2025-05-07T20:32:16.2789986Z ) -> None: 2025-05-07T20:32:16.2790273Z torch.manual_seed(2025) 2025-05-07T20:32:16.2790586Z 2025-05-07T20:32:16.2790956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2791397Z 2025-05-07T20:32:16.2791660Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2792066Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2792471Z x = x_sign * x_clamp 2025-05-07T20:32:16.2792772Z x0 = x[:, :D] 2025-05-07T20:32:16.2793025Z x1 = x[:, D:] 2025-05-07T20:32:16.2793320Z 2025-05-07T20:32:16.2793584Z if contiguous: 2025-05-07T20:32:16.2793894Z x0 = x0.contiguous() 2025-05-07T20:32:16.2794253Z x1 = x1.contiguous() 2025-05-07T20:32:16.2794587Z 2025-05-07T20:32:16.2794844Z if scale_ub is not None: 2025-05-07T20:32:16.2795228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2795694Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2796125Z ) 2025-05-07T20:32:16.2796403Z else: 2025-05-07T20:32:16.2796700Z scale_ub_tensor = None 2025-05-07T20:32:16.2797048Z 2025-05-07T20:32:16.2797374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2797809Z op = silu_mul_quant 2025-05-07T20:32:16.2798173Z if compiled: 2025-05-07T20:32:16.2798541Z op = torch.compile(op) 2025-05-07T20:32:16.2798942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2799307Z 2025-05-07T20:32:16.2799566Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.2799949Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.2800339Z 2025-05-07T20:32:16.2800633Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2801069Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.2801424Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.2801839Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.2802293Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2802700Z 2025-05-07T20:32:16.2802953Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.2803204Z 2025-05-07T20:32:16.2803343Z moe/activation_test.py:126: 2025-05-07T20:32:16.2803756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2804227Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.2804675Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2805792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.2806832Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.2807569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2808545Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2809583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.2810554Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.2811523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.2812374Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.2813410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.2814123Z fn() 2025-05-07T20:32:16.2814824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.2815622Z self.fn.run( 2025-05-07T20:32:16.2816176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2816702Z kernel = self.compile( 2025-05-07T20:32:16.2817249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2817897Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2818299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2818526Z 2025-05-07T20:32:16.2818739Z self = 2025-05-07T20:32:16.2819813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2821171Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a700c0860>} 2025-05-07T20:32:16.2822499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2823501Z context = 2025-05-07T20:32:16.2823794Z 2025-05-07T20:32:16.2823966Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2824486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2824955Z module_map=module_map) 2025-05-07T20:32:16.2825316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2825677Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.2825949Z E ^ 2025-05-07T20:32:16.2826408Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2826859Z 2025-05-07T20:32:16.2827277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2827785Z 2025-05-07T20:32:16.2827894Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2828343Z self=, 2025-05-07T20:32:16.2828761Z T=4096, 2025-05-07T20:32:16.2828959Z D=5120, 2025-05-07T20:32:16.2829154Z scale_ub=None, 2025-05-07T20:32:16.2829371Z contiguous=False, 2025-05-07T20:32:16.2829599Z compiled=False, 2025-05-07T20:32:16.2829809Z ) 2025-05-07T20:32:16.2849120Z self = 2025-05-07T20:32:16.2849664Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.2849947Z 2025-05-07T20:32:16.2850031Z @given( 2025-05-07T20:32:16.2850275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2850584Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2851028Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2851364Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2851695Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2851988Z ) 2025-05-07T20:32:16.2852343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2852784Z def test_silu_mul_quant( 2025-05-07T20:32:16.2853272Z self, 2025-05-07T20:32:16.2853469Z T: int, 2025-05-07T20:32:16.2853675Z D: int, 2025-05-07T20:32:16.2853886Z scale_ub: Optional[float], 2025-05-07T20:32:16.2854160Z contiguous: bool, 2025-05-07T20:32:16.2854403Z compiled: bool, 2025-05-07T20:32:16.2854621Z ) -> None: 2025-05-07T20:32:16.2854841Z torch.manual_seed(2025) 2025-05-07T20:32:16.2855089Z 2025-05-07T20:32:16.2855356Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2855697Z 2025-05-07T20:32:16.2855904Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2856189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2856495Z x = x_sign * x_clamp 2025-05-07T20:32:16.2856735Z x0 = x[:, :D] 2025-05-07T20:32:16.2856940Z x1 = x[:, D:] 2025-05-07T20:32:16.2857141Z 2025-05-07T20:32:16.2857328Z if contiguous: 2025-05-07T20:32:16.2857552Z x0 = x0.contiguous() 2025-05-07T20:32:16.2857812Z x1 = x1.contiguous() 2025-05-07T20:32:16.2858052Z 2025-05-07T20:32:16.2858258Z if scale_ub is not None: 2025-05-07T20:32:16.2858556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2858892Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2859475Z ) 2025-05-07T20:32:16.2859726Z else: 2025-05-07T20:32:16.2859938Z scale_ub_tensor = None 2025-05-07T20:32:16.2860188Z 2025-05-07T20:32:16.2860416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2860734Z op = silu_mul_quant 2025-05-07T20:32:16.2860987Z if compiled: 2025-05-07T20:32:16.2861230Z op = torch.compile(op) 2025-05-07T20:32:16.2861523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2861800Z 2025-05-07T20:32:16.2861986Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.2862155Z 2025-05-07T20:32:16.2862255Z moe/activation_test.py:117: 2025-05-07T20:32:16.2862561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2862887Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.2863160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2863851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.2864539Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.2865068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2865743Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2866407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2866934Z kernel = self.compile( 2025-05-07T20:32:16.2867467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2867651Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2867779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2867784Z 2025-05-07T20:32:16.2867993Z self = 2025-05-07T20:32:16.2869040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2869543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a596c40e0>} 2025-05-07T20:32:16.2870285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2870606Z context = 2025-05-07T20:32:16.2870611Z 2025-05-07T20:32:16.2870784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2871043Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2871153Z module_map=module_map) 2025-05-07T20:32:16.2871330Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2871430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.2871510Z E ^ 2025-05-07T20:32:16.2871868Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2871873Z 2025-05-07T20:32:16.2872282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2872292Z 2025-05-07T20:32:16.2872405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2872625Z self=, 2025-05-07T20:32:16.2872704Z T=4096, 2025-05-07T20:32:16.2872790Z D=7168, 2025-05-07T20:32:16.2872876Z scale_ub=None, 2025-05-07T20:32:16.2872964Z contiguous=False, 2025-05-07T20:32:16.2873058Z compiled=False, 2025-05-07T20:32:16.2873134Z ) 2025-05-07T20:32:16.2873356Z self = 2025-05-07T20:32:16.2873534Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.2873539Z 2025-05-07T20:32:16.2873619Z @given( 2025-05-07T20:32:16.2873744Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2873843Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2873961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2874091Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2874210Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2874288Z ) 2025-05-07T20:32:16.2874539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2874634Z def test_silu_mul_quant( 2025-05-07T20:32:16.2874718Z self, 2025-05-07T20:32:16.2874797Z T: int, 2025-05-07T20:32:16.2874874Z D: int, 2025-05-07T20:32:16.2874981Z scale_ub: Optional[float], 2025-05-07T20:32:16.2875070Z contiguous: bool, 2025-05-07T20:32:16.2875163Z compiled: bool, 2025-05-07T20:32:16.2875249Z ) -> None: 2025-05-07T20:32:16.2875345Z torch.manual_seed(2025) 2025-05-07T20:32:16.2875421Z 2025-05-07T20:32:16.2875597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2875673Z 2025-05-07T20:32:16.2875765Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2875896Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2875994Z x = x_sign * x_clamp 2025-05-07T20:32:16.2876080Z x0 = x[:, :D] 2025-05-07T20:32:16.2876166Z x1 = x[:, D:] 2025-05-07T20:32:16.2876246Z 2025-05-07T20:32:16.2876338Z if contiguous: 2025-05-07T20:32:16.2876432Z x0 = x0.contiguous() 2025-05-07T20:32:16.2876520Z x1 = x1.contiguous() 2025-05-07T20:32:16.2876601Z 2025-05-07T20:32:16.2876692Z if scale_ub is not None: 2025-05-07T20:32:16.2876797Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2877023Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2877102Z ) 2025-05-07T20:32:16.2877178Z else: 2025-05-07T20:32:16.2877282Z scale_ub_tensor = None 2025-05-07T20:32:16.2877358Z 2025-05-07T20:32:16.2877487Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2877582Z op = silu_mul_quant 2025-05-07T20:32:16.2877668Z if compiled: 2025-05-07T20:32:16.2877851Z op = torch.compile(op) 2025-05-07T20:32:16.2877956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2878030Z 2025-05-07T20:32:16.2878130Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.2878135Z 2025-05-07T20:32:16.2878231Z moe/activation_test.py:117: 2025-05-07T20:32:16.2878359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2878471Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.2878571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2879076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.2879175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.2879531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2879757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2880096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2880196Z kernel = self.compile( 2025-05-07T20:32:16.2880580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2880753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2880892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2880897Z 2025-05-07T20:32:16.2881102Z self = 2025-05-07T20:32:16.2881870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2882377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a596c6160>} 2025-05-07T20:32:16.2883110Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2883310Z context = 2025-05-07T20:32:16.2883315Z 2025-05-07T20:32:16.2883484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2883744Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2883859Z module_map=module_map) 2025-05-07T20:32:16.2884020Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2884129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.2884213Z E ^ 2025-05-07T20:32:16.2884562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2884567Z 2025-05-07T20:32:16.2884984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2884988Z 2025-05-07T20:32:16.2885091Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2885319Z self=, 2025-05-07T20:32:16.2885506Z T=128, 2025-05-07T20:32:16.2885589Z D=7168, 2025-05-07T20:32:16.2885676Z scale_ub=None, 2025-05-07T20:32:16.2885765Z contiguous=False, 2025-05-07T20:32:16.2885851Z compiled=True, 2025-05-07T20:32:16.2885934Z ) 2025-05-07T20:32:16.2886151Z self = 2025-05-07T20:32:16.2886319Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.2886483Z 2025-05-07T20:32:16.2886569Z @given( 2025-05-07T20:32:16.2886689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2886795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2886911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2887027Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2887146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2887220Z ) 2025-05-07T20:32:16.2887467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2887572Z def test_silu_mul_quant( 2025-05-07T20:32:16.2887651Z self, 2025-05-07T20:32:16.2887728Z T: int, 2025-05-07T20:32:16.2887811Z D: int, 2025-05-07T20:32:16.2887911Z scale_ub: Optional[float], 2025-05-07T20:32:16.2888005Z contiguous: bool, 2025-05-07T20:32:16.2888112Z compiled: bool, 2025-05-07T20:32:16.2888203Z ) -> None: 2025-05-07T20:32:16.2888330Z torch.manual_seed(2025) 2025-05-07T20:32:16.2888409Z 2025-05-07T20:32:16.2888579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2888663Z 2025-05-07T20:32:16.2888757Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2888884Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2888979Z x = x_sign * x_clamp 2025-05-07T20:32:16.2889061Z x0 = x[:, :D] 2025-05-07T20:32:16.2889141Z x1 = x[:, D:] 2025-05-07T20:32:16.2889223Z 2025-05-07T20:32:16.2889315Z if contiguous: 2025-05-07T20:32:16.2889407Z x0 = x0.contiguous() 2025-05-07T20:32:16.2889504Z x1 = x1.contiguous() 2025-05-07T20:32:16.2889580Z 2025-05-07T20:32:16.2889678Z if scale_ub is not None: 2025-05-07T20:32:16.2889785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2889917Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2890007Z ) 2025-05-07T20:32:16.2890085Z else: 2025-05-07T20:32:16.2890182Z scale_ub_tensor = None 2025-05-07T20:32:16.2890265Z 2025-05-07T20:32:16.2890397Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2890487Z op = silu_mul_quant 2025-05-07T20:32:16.2890578Z if compiled: 2025-05-07T20:32:16.2890679Z op = torch.compile(op) 2025-05-07T20:32:16.2890782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2890863Z 2025-05-07T20:32:16.2890958Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.2891084Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.2891160Z 2025-05-07T20:32:16.2891298Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2891407Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.2891508Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.2891630Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.2891779Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2891853Z 2025-05-07T20:32:16.2891953Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.2891958Z 2025-05-07T20:32:16.2892065Z moe/activation_test.py:126: 2025-05-07T20:32:16.2892193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2892304Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.2892439Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2893137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.2893249Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.2893606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2893827Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2894273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.2894527Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.2894903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.2895069Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.2895409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.2895494Z fn() 2025-05-07T20:32:16.2895890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.2895977Z self.fn.run( 2025-05-07T20:32:16.2896311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2896412Z kernel = self.compile( 2025-05-07T20:32:16.2896794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2896967Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2897094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2897099Z 2025-05-07T20:32:16.2897310Z self = 2025-05-07T20:32:16.2898078Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2898632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5a226340>} 2025-05-07T20:32:16.2899368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2899567Z context = 2025-05-07T20:32:16.2899571Z 2025-05-07T20:32:16.2899735Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2899999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2900113Z module_map=module_map) 2025-05-07T20:32:16.2900274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2900376Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.2900462Z E ^ 2025-05-07T20:32:16.2900813Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2900824Z 2025-05-07T20:32:16.2901236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2901240Z 2025-05-07T20:32:16.2901343Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2901564Z self=, 2025-05-07T20:32:16.2901653Z T=128, 2025-05-07T20:32:16.2901730Z D=7168, 2025-05-07T20:32:16.2901811Z scale_ub=None, 2025-05-07T20:32:16.2901908Z contiguous=False, 2025-05-07T20:32:16.2902098Z compiled=False, 2025-05-07T20:32:16.2902182Z ) 2025-05-07T20:32:16.2902431Z self = 2025-05-07T20:32:16.2902619Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.2902623Z 2025-05-07T20:32:16.2902707Z @given( 2025-05-07T20:32:16.2902834Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2903016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2903145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2903271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2903391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2903473Z ) 2025-05-07T20:32:16.2903759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2903859Z def test_silu_mul_quant( 2025-05-07T20:32:16.2903936Z self, 2025-05-07T20:32:16.2904018Z T: int, 2025-05-07T20:32:16.2904101Z D: int, 2025-05-07T20:32:16.2904203Z scale_ub: Optional[float], 2025-05-07T20:32:16.2904295Z contiguous: bool, 2025-05-07T20:32:16.2904388Z compiled: bool, 2025-05-07T20:32:16.2904469Z ) -> None: 2025-05-07T20:32:16.2904569Z torch.manual_seed(2025) 2025-05-07T20:32:16.2904651Z 2025-05-07T20:32:16.2904839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2904918Z 2025-05-07T20:32:16.2905017Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2905151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2905247Z x = x_sign * x_clamp 2025-05-07T20:32:16.2905329Z x0 = x[:, :D] 2025-05-07T20:32:16.2905409Z x1 = x[:, D:] 2025-05-07T20:32:16.2905488Z 2025-05-07T20:32:16.2905573Z if contiguous: 2025-05-07T20:32:16.2905667Z x0 = x0.contiguous() 2025-05-07T20:32:16.2905765Z x1 = x1.contiguous() 2025-05-07T20:32:16.2905843Z 2025-05-07T20:32:16.2905936Z if scale_ub is not None: 2025-05-07T20:32:16.2906052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2906195Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2906272Z ) 2025-05-07T20:32:16.2906353Z else: 2025-05-07T20:32:16.2906451Z scale_ub_tensor = None 2025-05-07T20:32:16.2906532Z 2025-05-07T20:32:16.2906675Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2906767Z op = silu_mul_quant 2025-05-07T20:32:16.2906858Z if compiled: 2025-05-07T20:32:16.2906961Z op = torch.compile(op) 2025-05-07T20:32:16.2907070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2907148Z 2025-05-07T20:32:16.2907241Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.2907246Z 2025-05-07T20:32:16.2907346Z moe/activation_test.py:117: 2025-05-07T20:32:16.2907496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2907602Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.2907707Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2908312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.2908413Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.2908847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2909102Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2909501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2909605Z kernel = self.compile( 2025-05-07T20:32:16.2910059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2910345Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2910478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2910483Z 2025-05-07T20:32:16.2910688Z self = 2025-05-07T20:32:16.2911458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2912038Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58fe8180>} 2025-05-07T20:32:16.2912780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2912970Z context = 2025-05-07T20:32:16.2912975Z 2025-05-07T20:32:16.2913140Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2913406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2913514Z module_map=module_map) 2025-05-07T20:32:16.2913687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2913790Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.2913865Z E ^ 2025-05-07T20:32:16.2914219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2914224Z 2025-05-07T20:32:16.2914630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2914635Z 2025-05-07T20:32:16.2914752Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2914974Z self=, 2025-05-07T20:32:16.2915053Z T=4096, 2025-05-07T20:32:16.2915137Z D=5120, 2025-05-07T20:32:16.2915221Z scale_ub=1200.0, 2025-05-07T20:32:16.2915310Z contiguous=True, 2025-05-07T20:32:16.2915399Z compiled=False, 2025-05-07T20:32:16.2915474Z ) 2025-05-07T20:32:16.2915698Z self = 2025-05-07T20:32:16.2915875Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.2915879Z 2025-05-07T20:32:16.2915956Z @given( 2025-05-07T20:32:16.2916082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2916183Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2916301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2916426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2916544Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2916618Z ) 2025-05-07T20:32:16.2916868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2916961Z def test_silu_mul_quant( 2025-05-07T20:32:16.2917038Z self, 2025-05-07T20:32:16.2917121Z T: int, 2025-05-07T20:32:16.2917198Z D: int, 2025-05-07T20:32:16.2917302Z scale_ub: Optional[float], 2025-05-07T20:32:16.2917399Z contiguous: bool, 2025-05-07T20:32:16.2917486Z compiled: bool, 2025-05-07T20:32:16.2917570Z ) -> None: 2025-05-07T20:32:16.2917665Z torch.manual_seed(2025) 2025-05-07T20:32:16.2917740Z 2025-05-07T20:32:16.2917917Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2917992Z 2025-05-07T20:32:16.2918108Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2918254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2918460Z x = x_sign * x_clamp 2025-05-07T20:32:16.2918542Z x0 = x[:, :D] 2025-05-07T20:32:16.2918630Z x1 = x[:, D:] 2025-05-07T20:32:16.2918702Z 2025-05-07T20:32:16.2918784Z if contiguous: 2025-05-07T20:32:16.2918883Z x0 = x0.contiguous() 2025-05-07T20:32:16.2918972Z x1 = x1.contiguous() 2025-05-07T20:32:16.2919051Z 2025-05-07T20:32:16.2919142Z if scale_ub is not None: 2025-05-07T20:32:16.2919324Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2919463Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2919539Z ) 2025-05-07T20:32:16.2919617Z else: 2025-05-07T20:32:16.2919717Z scale_ub_tensor = None 2025-05-07T20:32:16.2919794Z 2025-05-07T20:32:16.2919924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2920020Z op = silu_mul_quant 2025-05-07T20:32:16.2920109Z if compiled: 2025-05-07T20:32:16.2920213Z op = torch.compile(op) 2025-05-07T20:32:16.2920324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2920398Z 2025-05-07T20:32:16.2920494Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.2920499Z 2025-05-07T20:32:16.2920594Z moe/activation_test.py:117: 2025-05-07T20:32:16.2920721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2920828Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.2920932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2921426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.2921529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.2921884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2922109Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2922448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2922545Z kernel = self.compile( 2025-05-07T20:32:16.2922927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2923101Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2923231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2923242Z 2025-05-07T20:32:16.2923446Z self = 2025-05-07T20:32:16.2924209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2924714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58fe9ee0>} 2025-05-07T20:32:16.2925444Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2925637Z context = 2025-05-07T20:32:16.2925646Z 2025-05-07T20:32:16.2925811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2926067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2926182Z module_map=module_map) 2025-05-07T20:32:16.2926343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2926448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.2926528Z E ^ 2025-05-07T20:32:16.2926962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2926966Z 2025-05-07T20:32:16.2927384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2927388Z 2025-05-07T20:32:16.2927491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2927711Z self=, 2025-05-07T20:32:16.2927872Z T=1, 2025-05-07T20:32:16.2927953Z D=5120, 2025-05-07T20:32:16.2928046Z scale_ub=None, 2025-05-07T20:32:16.2928132Z contiguous=True, 2025-05-07T20:32:16.2928227Z compiled=True, 2025-05-07T20:32:16.2928322Z ) 2025-05-07T20:32:16.2928564Z self = 2025-05-07T20:32:16.2928726Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.2928730Z 2025-05-07T20:32:16.2928817Z @given( 2025-05-07T20:32:16.2928937Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2929038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2929162Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2929282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2929407Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2929482Z ) 2025-05-07T20:32:16.2929729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2929828Z def test_silu_mul_quant( 2025-05-07T20:32:16.2929907Z self, 2025-05-07T20:32:16.2929984Z T: int, 2025-05-07T20:32:16.2930066Z D: int, 2025-05-07T20:32:16.2930166Z scale_ub: Optional[float], 2025-05-07T20:32:16.2930256Z contiguous: bool, 2025-05-07T20:32:16.2930349Z compiled: bool, 2025-05-07T20:32:16.2930430Z ) -> None: 2025-05-07T20:32:16.2930526Z torch.manual_seed(2025) 2025-05-07T20:32:16.2930615Z 2025-05-07T20:32:16.2930781Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2930859Z 2025-05-07T20:32:16.2930953Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2931080Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2931182Z x = x_sign * x_clamp 2025-05-07T20:32:16.2931262Z x0 = x[:, :D] 2025-05-07T20:32:16.2931350Z x1 = x[:, D:] 2025-05-07T20:32:16.2931427Z 2025-05-07T20:32:16.2931512Z if contiguous: 2025-05-07T20:32:16.2931604Z x0 = x0.contiguous() 2025-05-07T20:32:16.2931704Z x1 = x1.contiguous() 2025-05-07T20:32:16.2931779Z 2025-05-07T20:32:16.2931870Z if scale_ub is not None: 2025-05-07T20:32:16.2931982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2932115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2932195Z ) 2025-05-07T20:32:16.2932272Z else: 2025-05-07T20:32:16.2932371Z scale_ub_tensor = None 2025-05-07T20:32:16.2932457Z 2025-05-07T20:32:16.2932586Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2932677Z op = silu_mul_quant 2025-05-07T20:32:16.2932768Z if compiled: 2025-05-07T20:32:16.2932870Z op = torch.compile(op) 2025-05-07T20:32:16.2932977Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2933109Z 2025-05-07T20:32:16.2933201Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.2933323Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.2933401Z 2025-05-07T20:32:16.2933538Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2933648Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.2933750Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.2933875Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.2934104Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2934180Z 2025-05-07T20:32:16.2934280Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.2934285Z 2025-05-07T20:32:16.2934386Z moe/activation_test.py:126: 2025-05-07T20:32:16.2934515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2934619Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.2934833Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2935386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.2935497Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.2935854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2936073Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2936447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.2936701Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.2937073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.2937238Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.2937577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.2937663Z fn() 2025-05-07T20:32:16.2938063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.2938145Z self.fn.run( 2025-05-07T20:32:16.2938484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2938584Z kernel = self.compile( 2025-05-07T20:32:16.2938966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2939142Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2939268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2939273Z 2025-05-07T20:32:16.2939482Z self = 2025-05-07T20:32:16.2940252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2940758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58feb420>} 2025-05-07T20:32:16.2941496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2941685Z context = 2025-05-07T20:32:16.2941696Z 2025-05-07T20:32:16.2941866Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2942126Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2942243Z module_map=module_map) 2025-05-07T20:32:16.2942404Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2942507Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.2942594Z E ^ 2025-05-07T20:32:16.2942950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2942955Z 2025-05-07T20:32:16.2943451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2943455Z 2025-05-07T20:32:16.2943560Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2943783Z self=, 2025-05-07T20:32:16.2943870Z T=2048, 2025-05-07T20:32:16.2943948Z D=5120, 2025-05-07T20:32:16.2944129Z scale_ub=None, 2025-05-07T20:32:16.2944219Z contiguous=True, 2025-05-07T20:32:16.2944302Z compiled=True, 2025-05-07T20:32:16.2944376Z ) 2025-05-07T20:32:16.2944599Z self = 2025-05-07T20:32:16.2944767Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.2944772Z 2025-05-07T20:32:16.2944856Z @given( 2025-05-07T20:32:16.2944980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2945086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2945208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2945324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2945439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2945520Z ) 2025-05-07T20:32:16.2945763Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2945856Z def test_silu_mul_quant( 2025-05-07T20:32:16.2945950Z self, 2025-05-07T20:32:16.2946026Z T: int, 2025-05-07T20:32:16.2946109Z D: int, 2025-05-07T20:32:16.2946208Z scale_ub: Optional[float], 2025-05-07T20:32:16.2946296Z contiguous: bool, 2025-05-07T20:32:16.2946389Z compiled: bool, 2025-05-07T20:32:16.2946470Z ) -> None: 2025-05-07T20:32:16.2946562Z torch.manual_seed(2025) 2025-05-07T20:32:16.2946640Z 2025-05-07T20:32:16.2946810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2946891Z 2025-05-07T20:32:16.2946993Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2947117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2947208Z x = x_sign * x_clamp 2025-05-07T20:32:16.2947297Z x0 = x[:, :D] 2025-05-07T20:32:16.2947381Z x1 = x[:, D:] 2025-05-07T20:32:16.2947463Z 2025-05-07T20:32:16.2947548Z if contiguous: 2025-05-07T20:32:16.2947638Z x0 = x0.contiguous() 2025-05-07T20:32:16.2947741Z x1 = x1.contiguous() 2025-05-07T20:32:16.2947815Z 2025-05-07T20:32:16.2947907Z if scale_ub is not None: 2025-05-07T20:32:16.2948017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2948151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2948227Z ) 2025-05-07T20:32:16.2948315Z else: 2025-05-07T20:32:16.2948410Z scale_ub_tensor = None 2025-05-07T20:32:16.2948483Z 2025-05-07T20:32:16.2948622Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2948712Z op = silu_mul_quant 2025-05-07T20:32:16.2948796Z if compiled: 2025-05-07T20:32:16.2948905Z op = torch.compile(op) 2025-05-07T20:32:16.2949011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2949088Z 2025-05-07T20:32:16.2949181Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.2949302Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.2949386Z 2025-05-07T20:32:16.2949521Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2949623Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.2949732Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.2949853Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.2949993Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2950074Z 2025-05-07T20:32:16.2950178Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.2950269Z 2025-05-07T20:32:16.2950375Z moe/activation_test.py:126: 2025-05-07T20:32:16.2950502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2950608Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.2950749Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2951300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.2951483Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.2951841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2952063Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2952429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.2952688Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.2953064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.2953235Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.2953572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.2953660Z fn() 2025-05-07T20:32:16.2954055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.2954137Z self.fn.run( 2025-05-07T20:32:16.2954480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2954575Z kernel = self.compile( 2025-05-07T20:32:16.2954956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2955136Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2955265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2955270Z 2025-05-07T20:32:16.2955482Z self = 2025-05-07T20:32:16.2956242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2956740Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58561800>} 2025-05-07T20:32:16.2957480Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2957669Z context = 2025-05-07T20:32:16.2957674Z 2025-05-07T20:32:16.2957844Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2958105Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2958223Z module_map=module_map) 2025-05-07T20:32:16.2958423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2958535Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.2958621Z E ^ 2025-05-07T20:32:16.2958970Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2958975Z 2025-05-07T20:32:16.2959740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2959907Z 2025-05-07T20:32:16.2960023Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2960245Z self=, 2025-05-07T20:32:16.2960332Z T=128, 2025-05-07T20:32:16.2960407Z D=5120, 2025-05-07T20:32:16.2960488Z scale_ub=None, 2025-05-07T20:32:16.2960580Z contiguous=True, 2025-05-07T20:32:16.2960664Z compiled=True, 2025-05-07T20:32:16.2960863Z ) 2025-05-07T20:32:16.2961088Z self = 2025-05-07T20:32:16.2961256Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.2961260Z 2025-05-07T20:32:16.2961335Z @given( 2025-05-07T20:32:16.2961460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2961562Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2961684Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2961807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2961922Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2962008Z ) 2025-05-07T20:32:16.2962253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2962347Z def test_silu_mul_quant( 2025-05-07T20:32:16.2962432Z self, 2025-05-07T20:32:16.2962509Z T: int, 2025-05-07T20:32:16.2962586Z D: int, 2025-05-07T20:32:16.2962696Z scale_ub: Optional[float], 2025-05-07T20:32:16.2962782Z contiguous: bool, 2025-05-07T20:32:16.2962865Z compiled: bool, 2025-05-07T20:32:16.2962947Z ) -> None: 2025-05-07T20:32:16.2963043Z torch.manual_seed(2025) 2025-05-07T20:32:16.2963118Z 2025-05-07T20:32:16.2963284Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2963357Z 2025-05-07T20:32:16.2963456Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2963579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2963672Z x = x_sign * x_clamp 2025-05-07T20:32:16.2963757Z x0 = x[:, :D] 2025-05-07T20:32:16.2963836Z x1 = x[:, D:] 2025-05-07T20:32:16.2963907Z 2025-05-07T20:32:16.2963995Z if contiguous: 2025-05-07T20:32:16.2964086Z x0 = x0.contiguous() 2025-05-07T20:32:16.2964174Z x1 = x1.contiguous() 2025-05-07T20:32:16.2964252Z 2025-05-07T20:32:16.2964343Z if scale_ub is not None: 2025-05-07T20:32:16.2964453Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2964594Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2964671Z ) 2025-05-07T20:32:16.2964749Z else: 2025-05-07T20:32:16.2964842Z scale_ub_tensor = None 2025-05-07T20:32:16.2964919Z 2025-05-07T20:32:16.2965055Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2965146Z op = silu_mul_quant 2025-05-07T20:32:16.2965230Z if compiled: 2025-05-07T20:32:16.2965340Z op = torch.compile(op) 2025-05-07T20:32:16.2965446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2965521Z 2025-05-07T20:32:16.2965617Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.2965740Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.2965814Z 2025-05-07T20:32:16.2965960Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2966067Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.2966170Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.2966291Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.2966431Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2966513Z 2025-05-07T20:32:16.2966614Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.2966618Z 2025-05-07T20:32:16.2966714Z moe/activation_test.py:126: 2025-05-07T20:32:16.2966936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2967042Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.2967181Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2967730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.2967830Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.2968278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2968539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2968898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.2969161Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.2969533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.2969702Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.2970036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.2970115Z fn() 2025-05-07T20:32:16.2970523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.2970611Z self.fn.run( 2025-05-07T20:32:16.2970949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2971044Z kernel = self.compile( 2025-05-07T20:32:16.2971418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2971598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2971729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2971734Z 2025-05-07T20:32:16.2971934Z self = 2025-05-07T20:32:16.2972700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2973256Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58bad620>} 2025-05-07T20:32:16.2980801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2981019Z context = 2025-05-07T20:32:16.2981034Z 2025-05-07T20:32:16.2981205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2981463Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2981571Z module_map=module_map) 2025-05-07T20:32:16.2981737Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2981841Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.2981926Z E ^ 2025-05-07T20:32:16.2982280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2982285Z 2025-05-07T20:32:16.2982702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2982707Z 2025-05-07T20:32:16.2982815Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2983171Z self=, 2025-05-07T20:32:16.2983251Z T=4096, 2025-05-07T20:32:16.2983333Z D=5120, 2025-05-07T20:32:16.2983415Z scale_ub=None, 2025-05-07T20:32:16.2983497Z contiguous=True, 2025-05-07T20:32:16.2983584Z compiled=True, 2025-05-07T20:32:16.2983656Z ) 2025-05-07T20:32:16.2983873Z self = 2025-05-07T20:32:16.2984049Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.2984133Z 2025-05-07T20:32:16.2984209Z @given( 2025-05-07T20:32:16.2984337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.2984435Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.2984548Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.2984675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.2984786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.2984862Z ) 2025-05-07T20:32:16.2985115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.2985206Z def test_silu_mul_quant( 2025-05-07T20:32:16.2985293Z self, 2025-05-07T20:32:16.2985368Z T: int, 2025-05-07T20:32:16.2985443Z D: int, 2025-05-07T20:32:16.2985548Z scale_ub: Optional[float], 2025-05-07T20:32:16.2985636Z contiguous: bool, 2025-05-07T20:32:16.2985719Z compiled: bool, 2025-05-07T20:32:16.2985807Z ) -> None: 2025-05-07T20:32:16.2985902Z torch.manual_seed(2025) 2025-05-07T20:32:16.2985975Z 2025-05-07T20:32:16.2986147Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.2986220Z 2025-05-07T20:32:16.2986311Z x_sign = torch.sign(x) 2025-05-07T20:32:16.2986442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.2986531Z x = x_sign * x_clamp 2025-05-07T20:32:16.2986610Z x0 = x[:, :D] 2025-05-07T20:32:16.2986698Z x1 = x[:, D:] 2025-05-07T20:32:16.2986775Z 2025-05-07T20:32:16.2986863Z if contiguous: 2025-05-07T20:32:16.2986953Z x0 = x0.contiguous() 2025-05-07T20:32:16.2987039Z x1 = x1.contiguous() 2025-05-07T20:32:16.2987122Z 2025-05-07T20:32:16.2987212Z if scale_ub is not None: 2025-05-07T20:32:16.2987316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.2987451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.2987531Z ) 2025-05-07T20:32:16.2987601Z else: 2025-05-07T20:32:16.2987703Z scale_ub_tensor = None 2025-05-07T20:32:16.2987774Z 2025-05-07T20:32:16.2987901Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2987995Z op = silu_mul_quant 2025-05-07T20:32:16.2988076Z if compiled: 2025-05-07T20:32:16.2988196Z op = torch.compile(op) 2025-05-07T20:32:16.2988311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.2988402Z 2025-05-07T20:32:16.2988502Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.2988624Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.2988693Z 2025-05-07T20:32:16.2988831Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.2988936Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.2989036Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.2989170Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.2989309Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2989383Z 2025-05-07T20:32:16.2989481Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.2989485Z 2025-05-07T20:32:16.2989580Z moe/activation_test.py:126: 2025-05-07T20:32:16.2989713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2989818Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.2990036Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.2990597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.2990696Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.2991053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.2991349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.2991708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.2991965Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.2992335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.2992504Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.2992846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.2992921Z fn() 2025-05-07T20:32:16.2993324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.2993404Z self.fn.run( 2025-05-07T20:32:16.2993735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.2993837Z kernel = self.compile( 2025-05-07T20:32:16.2994211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.2994385Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.2994517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.2994521Z 2025-05-07T20:32:16.2994732Z self = 2025-05-07T20:32:16.2995502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.2995998Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a585fcae0>} 2025-05-07T20:32:16.2996738Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.2996924Z context = 2025-05-07T20:32:16.2996929Z 2025-05-07T20:32:16.2997091Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.2997357Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.2997465Z module_map=module_map) 2025-05-07T20:32:16.2997629Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.2997731Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.2997810Z E ^ 2025-05-07T20:32:16.2998181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.2998194Z 2025-05-07T20:32:16.2998633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.2998638Z 2025-05-07T20:32:16.2998744Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.2998970Z self=, 2025-05-07T20:32:16.2999047Z T=16384, 2025-05-07T20:32:16.2999131Z D=5120, 2025-05-07T20:32:16.2999212Z scale_ub=None, 2025-05-07T20:32:16.2999376Z contiguous=True, 2025-05-07T20:32:16.2999466Z compiled=True, 2025-05-07T20:32:16.2999759Z ) 2025-05-07T20:32:16.2999978Z self = 2025-05-07T20:32:16.3000153Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.3002991Z 2025-05-07T20:32:16.3003115Z @given( 2025-05-07T20:32:16.3003247Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3003457Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3003570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3003692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3003805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3003876Z ) 2025-05-07T20:32:16.3004127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3004219Z def test_silu_mul_quant( 2025-05-07T20:32:16.3004300Z self, 2025-05-07T20:32:16.3004382Z T: int, 2025-05-07T20:32:16.3004456Z D: int, 2025-05-07T20:32:16.3004554Z scale_ub: Optional[float], 2025-05-07T20:32:16.3004653Z contiguous: bool, 2025-05-07T20:32:16.3004735Z compiled: bool, 2025-05-07T20:32:16.3004821Z ) -> None: 2025-05-07T20:32:16.3004917Z torch.manual_seed(2025) 2025-05-07T20:32:16.3004990Z 2025-05-07T20:32:16.3005169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3005243Z 2025-05-07T20:32:16.3005335Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3005465Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3005560Z x = x_sign * x_clamp 2025-05-07T20:32:16.3005641Z x0 = x[:, :D] 2025-05-07T20:32:16.3005730Z x1 = x[:, D:] 2025-05-07T20:32:16.3005802Z 2025-05-07T20:32:16.3005886Z if contiguous: 2025-05-07T20:32:16.3005982Z x0 = x0.contiguous() 2025-05-07T20:32:16.3006075Z x1 = x1.contiguous() 2025-05-07T20:32:16.3006156Z 2025-05-07T20:32:16.3006246Z if scale_ub is not None: 2025-05-07T20:32:16.3006351Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3006487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3006564Z ) 2025-05-07T20:32:16.3006638Z else: 2025-05-07T20:32:16.3006734Z scale_ub_tensor = None 2025-05-07T20:32:16.3006812Z 2025-05-07T20:32:16.3006939Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3007034Z op = silu_mul_quant 2025-05-07T20:32:16.3007116Z if compiled: 2025-05-07T20:32:16.3007215Z op = torch.compile(op) 2025-05-07T20:32:16.3007326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3007396Z 2025-05-07T20:32:16.3007486Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.3007610Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.3007683Z 2025-05-07T20:32:16.3007824Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3007926Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.3008022Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.3008150Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.3008287Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.3008365Z 2025-05-07T20:32:16.3008472Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.3008477Z 2025-05-07T20:32:16.3008571Z moe/activation_test.py:126: 2025-05-07T20:32:16.3008704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3008808Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.3008940Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.3009584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.3009683Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.3010034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3010261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3010621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.3010981Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.3011348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.3011513Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.3011854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.3011939Z fn() 2025-05-07T20:32:16.3012330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.3012419Z self.fn.run( 2025-05-07T20:32:16.3012751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3012848Z kernel = self.compile( 2025-05-07T20:32:16.3013300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3013473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3013602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3013607Z 2025-05-07T20:32:16.3013810Z self = 2025-05-07T20:32:16.3014583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3015078Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a586b2660>} 2025-05-07T20:32:16.3015811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3016011Z context = 2025-05-07T20:32:16.3016016Z 2025-05-07T20:32:16.3016178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3016440Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3016546Z module_map=module_map) 2025-05-07T20:32:16.3016710Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3016819Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.3016898Z E ^ 2025-05-07T20:32:16.3017253Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3017258Z 2025-05-07T20:32:16.3017662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3017671Z 2025-05-07T20:32:16.3017772Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3018005Z self=, 2025-05-07T20:32:16.3018091Z T=1, 2025-05-07T20:32:16.3018189Z D=5120, 2025-05-07T20:32:16.3018280Z scale_ub=1200.0, 2025-05-07T20:32:16.3018383Z contiguous=True, 2025-05-07T20:32:16.3018474Z compiled=True, 2025-05-07T20:32:16.3018549Z ) 2025-05-07T20:32:16.3018850Z self = 2025-05-07T20:32:16.3019021Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.3019026Z 2025-05-07T20:32:16.3019101Z @given( 2025-05-07T20:32:16.3019221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3019326Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3019441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3019668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3019787Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3019863Z ) 2025-05-07T20:32:16.3020115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3020206Z def test_silu_mul_quant( 2025-05-07T20:32:16.3020283Z self, 2025-05-07T20:32:16.3020364Z T: int, 2025-05-07T20:32:16.3020437Z D: int, 2025-05-07T20:32:16.3020537Z scale_ub: Optional[float], 2025-05-07T20:32:16.3020632Z contiguous: bool, 2025-05-07T20:32:16.3020717Z compiled: bool, 2025-05-07T20:32:16.3020791Z ) -> None: 2025-05-07T20:32:16.3020897Z torch.manual_seed(2025) 2025-05-07T20:32:16.3020972Z 2025-05-07T20:32:16.3021139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3021218Z 2025-05-07T20:32:16.3021307Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3021442Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3021531Z x = x_sign * x_clamp 2025-05-07T20:32:16.3021613Z x0 = x[:, :D] 2025-05-07T20:32:16.3021697Z x1 = x[:, D:] 2025-05-07T20:32:16.3021768Z 2025-05-07T20:32:16.3021853Z if contiguous: 2025-05-07T20:32:16.3021950Z x0 = x0.contiguous() 2025-05-07T20:32:16.3022040Z x1 = x1.contiguous() 2025-05-07T20:32:16.3022110Z 2025-05-07T20:32:16.3022205Z if scale_ub is not None: 2025-05-07T20:32:16.3022318Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3022451Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3022530Z ) 2025-05-07T20:32:16.3022608Z else: 2025-05-07T20:32:16.3022703Z scale_ub_tensor = None 2025-05-07T20:32:16.3022782Z 2025-05-07T20:32:16.3022910Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3023009Z op = silu_mul_quant 2025-05-07T20:32:16.3023093Z if compiled: 2025-05-07T20:32:16.3023193Z op = torch.compile(op) 2025-05-07T20:32:16.3023301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3023375Z 2025-05-07T20:32:16.3023463Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3023468Z 2025-05-07T20:32:16.3023569Z moe/activation_test.py:117: 2025-05-07T20:32:16.3023696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3023802Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3023907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3024269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3024371Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3024861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3024963Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3025320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3025542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3025881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3025974Z kernel = self.compile( 2025-05-07T20:32:16.3026432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3026612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3026736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3026741Z 2025-05-07T20:32:16.3026944Z self = 2025-05-07T20:32:16.3027785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3028313Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7d9e0>} 2025-05-07T20:32:16.3029075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3029261Z context = 2025-05-07T20:32:16.3029265Z 2025-05-07T20:32:16.3029437Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3029694Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3029804Z module_map=module_map) 2025-05-07T20:32:16.3029968Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3030064Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3030142Z E ^ 2025-05-07T20:32:16.3030495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3030500Z 2025-05-07T20:32:16.3030907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3030912Z 2025-05-07T20:32:16.3031018Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3031238Z self=, 2025-05-07T20:32:16.3031317Z T=1, 2025-05-07T20:32:16.3031403Z D=5120, 2025-05-07T20:32:16.3031485Z scale_ub=None, 2025-05-07T20:32:16.3031572Z contiguous=False, 2025-05-07T20:32:16.3031663Z compiled=True, 2025-05-07T20:32:16.3031736Z ) 2025-05-07T20:32:16.3031958Z self = 2025-05-07T20:32:16.3032122Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3032126Z 2025-05-07T20:32:16.3032200Z @given( 2025-05-07T20:32:16.3032325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3032427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3032542Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3032668Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3032781Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3032859Z ) 2025-05-07T20:32:16.3033112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3033205Z def test_silu_mul_quant( 2025-05-07T20:32:16.3033294Z self, 2025-05-07T20:32:16.3033373Z T: int, 2025-05-07T20:32:16.3033456Z D: int, 2025-05-07T20:32:16.3033564Z scale_ub: Optional[float], 2025-05-07T20:32:16.3033657Z contiguous: bool, 2025-05-07T20:32:16.3033742Z compiled: bool, 2025-05-07T20:32:16.3033823Z ) -> None: 2025-05-07T20:32:16.3033919Z torch.manual_seed(2025) 2025-05-07T20:32:16.3033989Z 2025-05-07T20:32:16.3034160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3034232Z 2025-05-07T20:32:16.3034321Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3034530Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3034624Z x = x_sign * x_clamp 2025-05-07T20:32:16.3034712Z x0 = x[:, :D] 2025-05-07T20:32:16.3034792Z x1 = x[:, D:] 2025-05-07T20:32:16.3034864Z 2025-05-07T20:32:16.3034949Z if contiguous: 2025-05-07T20:32:16.3035039Z x0 = x0.contiguous() 2025-05-07T20:32:16.3035127Z x1 = x1.contiguous() 2025-05-07T20:32:16.3035281Z 2025-05-07T20:32:16.3035373Z if scale_ub is not None: 2025-05-07T20:32:16.3035480Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3035616Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3035692Z ) 2025-05-07T20:32:16.3035769Z else: 2025-05-07T20:32:16.3035866Z scale_ub_tensor = None 2025-05-07T20:32:16.3035940Z 2025-05-07T20:32:16.3036068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3036161Z op = silu_mul_quant 2025-05-07T20:32:16.3036249Z if compiled: 2025-05-07T20:32:16.3036355Z op = torch.compile(op) 2025-05-07T20:32:16.3036462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3036535Z 2025-05-07T20:32:16.3036632Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.3036754Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.3036827Z 2025-05-07T20:32:16.3036965Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3037070Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.3037166Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.3037293Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.3037435Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.3037515Z 2025-05-07T20:32:16.3037613Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.3037617Z 2025-05-07T20:32:16.3037712Z moe/activation_test.py:126: 2025-05-07T20:32:16.3037845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3037948Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.3038079Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.3038632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.3038737Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.3039097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3039314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3039672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.3039934Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.3040303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.3040467Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.3040816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.3040901Z fn() 2025-05-07T20:32:16.3041294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.3041380Z self.fn.run( 2025-05-07T20:32:16.3041720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3041812Z kernel = self.compile( 2025-05-07T20:32:16.3042182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3042474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3042601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3042606Z 2025-05-07T20:32:16.3042814Z self = 2025-05-07T20:32:16.3043575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3044152Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7f1a0>} 2025-05-07T20:32:16.3044880Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3045073Z context = 2025-05-07T20:32:16.3045078Z 2025-05-07T20:32:16.3045247Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3045504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3045617Z module_map=module_map) 2025-05-07T20:32:16.3045776Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3045884Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.3045963Z E ^ 2025-05-07T20:32:16.3046312Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3046316Z 2025-05-07T20:32:16.3046722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3046732Z 2025-05-07T20:32:16.3046833Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3047055Z self=, 2025-05-07T20:32:16.3047142Z T=1, 2025-05-07T20:32:16.3047223Z D=5120, 2025-05-07T20:32:16.3047306Z scale_ub=None, 2025-05-07T20:32:16.3047394Z contiguous=True, 2025-05-07T20:32:16.3047476Z compiled=False, 2025-05-07T20:32:16.3047545Z ) 2025-05-07T20:32:16.3047765Z self = 2025-05-07T20:32:16.3047929Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3047934Z 2025-05-07T20:32:16.3048011Z @given( 2025-05-07T20:32:16.3048128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3048230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3048358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3048485Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3048624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3048701Z ) 2025-05-07T20:32:16.3048941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3049033Z def test_silu_mul_quant( 2025-05-07T20:32:16.3049114Z self, 2025-05-07T20:32:16.3049187Z T: int, 2025-05-07T20:32:16.3049262Z D: int, 2025-05-07T20:32:16.3049367Z scale_ub: Optional[float], 2025-05-07T20:32:16.3049453Z contiguous: bool, 2025-05-07T20:32:16.3049546Z compiled: bool, 2025-05-07T20:32:16.3049622Z ) -> None: 2025-05-07T20:32:16.3049715Z torch.manual_seed(2025) 2025-05-07T20:32:16.3049791Z 2025-05-07T20:32:16.3049959Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3050030Z 2025-05-07T20:32:16.3050124Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3050245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3050330Z x = x_sign * x_clamp 2025-05-07T20:32:16.3050416Z x0 = x[:, :D] 2025-05-07T20:32:16.3050573Z x1 = x[:, D:] 2025-05-07T20:32:16.3050648Z 2025-05-07T20:32:16.3050735Z if contiguous: 2025-05-07T20:32:16.3050825Z x0 = x0.contiguous() 2025-05-07T20:32:16.3050913Z x1 = x1.contiguous() 2025-05-07T20:32:16.3050990Z 2025-05-07T20:32:16.3051080Z if scale_ub is not None: 2025-05-07T20:32:16.3051190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3051399Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3051473Z ) 2025-05-07T20:32:16.3051557Z else: 2025-05-07T20:32:16.3051649Z scale_ub_tensor = None 2025-05-07T20:32:16.3051718Z 2025-05-07T20:32:16.3051849Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3051937Z op = silu_mul_quant 2025-05-07T20:32:16.3052019Z if compiled: 2025-05-07T20:32:16.3052124Z op = torch.compile(op) 2025-05-07T20:32:16.3052234Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3052302Z 2025-05-07T20:32:16.3052394Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3052399Z 2025-05-07T20:32:16.3052495Z moe/activation_test.py:117: 2025-05-07T20:32:16.3052625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3052724Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3052824Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3053371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3053468Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3053824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3054049Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3054389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3054484Z kernel = self.compile( 2025-05-07T20:32:16.3054859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3055031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3055160Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3055170Z 2025-05-07T20:32:16.3055371Z self = 2025-05-07T20:32:16.3056135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3056632Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7e700>} 2025-05-07T20:32:16.3057361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3057552Z context = 2025-05-07T20:32:16.3057556Z 2025-05-07T20:32:16.3057723Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3057987Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3058096Z module_map=module_map) 2025-05-07T20:32:16.3058282Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3058395Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3058488Z E ^ 2025-05-07T20:32:16.3058922Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3058927Z 2025-05-07T20:32:16.3059613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3059622Z 2025-05-07T20:32:16.3059743Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3059976Z self=, 2025-05-07T20:32:16.3060213Z T=128, 2025-05-07T20:32:16.3060292Z D=5120, 2025-05-07T20:32:16.3060384Z scale_ub=None, 2025-05-07T20:32:16.3060473Z contiguous=False, 2025-05-07T20:32:16.3060563Z compiled=True, 2025-05-07T20:32:16.3060636Z ) 2025-05-07T20:32:16.3060854Z self = 2025-05-07T20:32:16.3061026Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3061030Z 2025-05-07T20:32:16.3061109Z @given( 2025-05-07T20:32:16.3061231Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3061336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3061449Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3061565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3061684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3061757Z ) 2025-05-07T20:32:16.3062004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3062105Z def test_silu_mul_quant( 2025-05-07T20:32:16.3062182Z self, 2025-05-07T20:32:16.3062265Z T: int, 2025-05-07T20:32:16.3062342Z D: int, 2025-05-07T20:32:16.3062439Z scale_ub: Optional[float], 2025-05-07T20:32:16.3062535Z contiguous: bool, 2025-05-07T20:32:16.3062622Z compiled: bool, 2025-05-07T20:32:16.3062702Z ) -> None: 2025-05-07T20:32:16.3062803Z torch.manual_seed(2025) 2025-05-07T20:32:16.3062878Z 2025-05-07T20:32:16.3063048Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3063128Z 2025-05-07T20:32:16.3063223Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3063356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3063447Z x = x_sign * x_clamp 2025-05-07T20:32:16.3063528Z x0 = x[:, :D] 2025-05-07T20:32:16.3063615Z x1 = x[:, D:] 2025-05-07T20:32:16.3063690Z 2025-05-07T20:32:16.3063780Z if contiguous: 2025-05-07T20:32:16.3063876Z x0 = x0.contiguous() 2025-05-07T20:32:16.3063965Z x1 = x1.contiguous() 2025-05-07T20:32:16.3064040Z 2025-05-07T20:32:16.3064135Z if scale_ub is not None: 2025-05-07T20:32:16.3064239Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3064374Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3064458Z ) 2025-05-07T20:32:16.3064536Z else: 2025-05-07T20:32:16.3064635Z scale_ub_tensor = None 2025-05-07T20:32:16.3064713Z 2025-05-07T20:32:16.3064842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3064937Z op = silu_mul_quant 2025-05-07T20:32:16.3065024Z if compiled: 2025-05-07T20:32:16.3065122Z op = torch.compile(op) 2025-05-07T20:32:16.3065233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3065309Z 2025-05-07T20:32:16.3065400Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3065408Z 2025-05-07T20:32:16.3065512Z moe/activation_test.py:117: 2025-05-07T20:32:16.3065639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3065745Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3065845Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3066207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3066306Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3066925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3067029Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3067385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3067605Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3068017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3068110Z kernel = self.compile( 2025-05-07T20:32:16.3068484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3068659Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3068784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3068794Z 2025-05-07T20:32:16.3068996Z self = 2025-05-07T20:32:16.3069763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3070259Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7df80>} 2025-05-07T20:32:16.3071001Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3071192Z context = 2025-05-07T20:32:16.3071196Z 2025-05-07T20:32:16.3071367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3071620Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3071726Z module_map=module_map) 2025-05-07T20:32:16.3071891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3071990Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3072070Z E ^ 2025-05-07T20:32:16.3072429Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3072433Z 2025-05-07T20:32:16.3072839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3072843Z 2025-05-07T20:32:16.3072949Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3073170Z self=, 2025-05-07T20:32:16.3073250Z T=128, 2025-05-07T20:32:16.3073337Z D=7168, 2025-05-07T20:32:16.3073422Z scale_ub=1200.0, 2025-05-07T20:32:16.3073509Z contiguous=False, 2025-05-07T20:32:16.3073600Z compiled=False, 2025-05-07T20:32:16.3073676Z ) 2025-05-07T20:32:16.3073895Z self = 2025-05-07T20:32:16.3074067Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.3074077Z 2025-05-07T20:32:16.3074152Z @given( 2025-05-07T20:32:16.3074274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3074374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3074488Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3074611Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3074728Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3074804Z ) 2025-05-07T20:32:16.3075052Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3075252Z def test_silu_mul_quant( 2025-05-07T20:32:16.3075335Z self, 2025-05-07T20:32:16.3075412Z T: int, 2025-05-07T20:32:16.3075488Z D: int, 2025-05-07T20:32:16.3075589Z scale_ub: Optional[float], 2025-05-07T20:32:16.3075681Z contiguous: bool, 2025-05-07T20:32:16.3075766Z compiled: bool, 2025-05-07T20:32:16.3075846Z ) -> None: 2025-05-07T20:32:16.3076019Z torch.manual_seed(2025) 2025-05-07T20:32:16.3076093Z 2025-05-07T20:32:16.3076271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3076342Z 2025-05-07T20:32:16.3076431Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3076560Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3076650Z x = x_sign * x_clamp 2025-05-07T20:32:16.3076740Z x0 = x[:, :D] 2025-05-07T20:32:16.3076821Z x1 = x[:, D:] 2025-05-07T20:32:16.3076893Z 2025-05-07T20:32:16.3076985Z if contiguous: 2025-05-07T20:32:16.3077076Z x0 = x0.contiguous() 2025-05-07T20:32:16.3077165Z x1 = x1.contiguous() 2025-05-07T20:32:16.3077244Z 2025-05-07T20:32:16.3077334Z if scale_ub is not None: 2025-05-07T20:32:16.3077437Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3077576Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3077652Z ) 2025-05-07T20:32:16.3077730Z else: 2025-05-07T20:32:16.3077830Z scale_ub_tensor = None 2025-05-07T20:32:16.3077904Z 2025-05-07T20:32:16.3078043Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3078143Z op = silu_mul_quant 2025-05-07T20:32:16.3078238Z if compiled: 2025-05-07T20:32:16.3078357Z op = torch.compile(op) 2025-05-07T20:32:16.3078468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3078541Z 2025-05-07T20:32:16.3078639Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3078648Z 2025-05-07T20:32:16.3078744Z moe/activation_test.py:117: 2025-05-07T20:32:16.3078871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3078976Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3079075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3079571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3079673Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3080026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3080249Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3080583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3080677Z kernel = self.compile( 2025-05-07T20:32:16.3081062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3081236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3081372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3081376Z 2025-05-07T20:32:16.3081579Z self = 2025-05-07T20:32:16.3082342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3082846Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43450360>} 2025-05-07T20:32:16.3083657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3083855Z context = 2025-05-07T20:32:16.3083859Z 2025-05-07T20:32:16.3084022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3084278Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3084509Z module_map=module_map) 2025-05-07T20:32:16.3084669Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3084772Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3084847Z E ^ 2025-05-07T20:32:16.3085199Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3085204Z 2025-05-07T20:32:16.3085617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3085622Z 2025-05-07T20:32:16.3085724Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3085950Z self=, 2025-05-07T20:32:16.3086024Z T=128, 2025-05-07T20:32:16.3086099Z D=5120, 2025-05-07T20:32:16.3086183Z scale_ub=None, 2025-05-07T20:32:16.3086277Z contiguous=False, 2025-05-07T20:32:16.3086358Z compiled=False, 2025-05-07T20:32:16.3086433Z ) 2025-05-07T20:32:16.3086648Z self = 2025-05-07T20:32:16.3086815Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.3086828Z 2025-05-07T20:32:16.3086904Z @given( 2025-05-07T20:32:16.3087022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3087129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3087251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3087365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3087483Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3087557Z ) 2025-05-07T20:32:16.3087797Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3087895Z def test_silu_mul_quant( 2025-05-07T20:32:16.3087972Z self, 2025-05-07T20:32:16.3088056Z T: int, 2025-05-07T20:32:16.3088139Z D: int, 2025-05-07T20:32:16.3088237Z scale_ub: Optional[float], 2025-05-07T20:32:16.3088332Z contiguous: bool, 2025-05-07T20:32:16.3088417Z compiled: bool, 2025-05-07T20:32:16.3088493Z ) -> None: 2025-05-07T20:32:16.3088594Z torch.manual_seed(2025) 2025-05-07T20:32:16.3088666Z 2025-05-07T20:32:16.3088833Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3088909Z 2025-05-07T20:32:16.3089006Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3089130Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3089227Z x = x_sign * x_clamp 2025-05-07T20:32:16.3089307Z x0 = x[:, :D] 2025-05-07T20:32:16.3089387Z x1 = x[:, D:] 2025-05-07T20:32:16.3089463Z 2025-05-07T20:32:16.3089546Z if contiguous: 2025-05-07T20:32:16.3089637Z x0 = x0.contiguous() 2025-05-07T20:32:16.3089734Z x1 = x1.contiguous() 2025-05-07T20:32:16.3089802Z 2025-05-07T20:32:16.3089895Z if scale_ub is not None: 2025-05-07T20:32:16.3090003Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3090139Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3090222Z ) 2025-05-07T20:32:16.3090297Z else: 2025-05-07T20:32:16.3090392Z scale_ub_tensor = None 2025-05-07T20:32:16.3090465Z 2025-05-07T20:32:16.3090594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3090766Z op = silu_mul_quant 2025-05-07T20:32:16.3090855Z if compiled: 2025-05-07T20:32:16.3090954Z op = torch.compile(op) 2025-05-07T20:32:16.3091057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3091135Z 2025-05-07T20:32:16.3091227Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3091231Z 2025-05-07T20:32:16.3091327Z moe/activation_test.py:117: 2025-05-07T20:32:16.3091534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3091635Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3091736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3092225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3092323Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3092685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3092904Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3093293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3093386Z kernel = self.compile( 2025-05-07T20:32:16.3093763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3093945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3094075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3094079Z 2025-05-07T20:32:16.3094291Z self = 2025-05-07T20:32:16.3095054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3095550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a434514e0>} 2025-05-07T20:32:16.3096284Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3096478Z context = 2025-05-07T20:32:16.3096483Z 2025-05-07T20:32:16.3096650Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3096905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3097012Z module_map=module_map) 2025-05-07T20:32:16.3097176Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3097280Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3097364Z E ^ 2025-05-07T20:32:16.3097710Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3097714Z 2025-05-07T20:32:16.3098116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3098125Z 2025-05-07T20:32:16.3098230Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3098447Z self=, 2025-05-07T20:32:16.3098527Z T=128, 2025-05-07T20:32:16.3098607Z D=5120, 2025-05-07T20:32:16.3098689Z scale_ub=1200.0, 2025-05-07T20:32:16.3098778Z contiguous=True, 2025-05-07T20:32:16.3098861Z compiled=False, 2025-05-07T20:32:16.3098934Z ) 2025-05-07T20:32:16.3099154Z self = 2025-05-07T20:32:16.3099403Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.3099409Z 2025-05-07T20:32:16.3099483Z @given( 2025-05-07T20:32:16.3099608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3099705Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3099823Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3099945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3100159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3100238Z ) 2025-05-07T20:32:16.3100482Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3100572Z def test_silu_mul_quant( 2025-05-07T20:32:16.3100653Z self, 2025-05-07T20:32:16.3100730Z T: int, 2025-05-07T20:32:16.3100804Z D: int, 2025-05-07T20:32:16.3100907Z scale_ub: Optional[float], 2025-05-07T20:32:16.3100996Z contiguous: bool, 2025-05-07T20:32:16.3101086Z compiled: bool, 2025-05-07T20:32:16.3101166Z ) -> None: 2025-05-07T20:32:16.3101261Z torch.manual_seed(2025) 2025-05-07T20:32:16.3101333Z 2025-05-07T20:32:16.3101566Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3105802Z 2025-05-07T20:32:16.3105904Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3106031Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3106133Z x = x_sign * x_clamp 2025-05-07T20:32:16.3106211Z x0 = x[:, :D] 2025-05-07T20:32:16.3106295Z x1 = x[:, D:] 2025-05-07T20:32:16.3106366Z 2025-05-07T20:32:16.3106446Z if contiguous: 2025-05-07T20:32:16.3106541Z x0 = x0.contiguous() 2025-05-07T20:32:16.3106628Z x1 = x1.contiguous() 2025-05-07T20:32:16.3106700Z 2025-05-07T20:32:16.3106797Z if scale_ub is not None: 2025-05-07T20:32:16.3106901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3107039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3107117Z ) 2025-05-07T20:32:16.3107193Z else: 2025-05-07T20:32:16.3107283Z scale_ub_tensor = None 2025-05-07T20:32:16.3107360Z 2025-05-07T20:32:16.3107493Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3107586Z op = silu_mul_quant 2025-05-07T20:32:16.3107669Z if compiled: 2025-05-07T20:32:16.3107770Z op = torch.compile(op) 2025-05-07T20:32:16.3107879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3107949Z 2025-05-07T20:32:16.3108040Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3108047Z 2025-05-07T20:32:16.3108167Z moe/activation_test.py:117: 2025-05-07T20:32:16.3108315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3108415Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3108516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3109016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3109118Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3109474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3109692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3110037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3110131Z kernel = self.compile( 2025-05-07T20:32:16.3110506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3110685Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3110809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3110917Z 2025-05-07T20:32:16.3111126Z self = 2025-05-07T20:32:16.3111885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3112385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43453560>} 2025-05-07T20:32:16.3113190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3113377Z context = 2025-05-07T20:32:16.3113381Z 2025-05-07T20:32:16.3113551Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3113810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3113918Z module_map=module_map) 2025-05-07T20:32:16.3114076Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3114175Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3114261Z E ^ 2025-05-07T20:32:16.3114613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3114618Z 2025-05-07T20:32:16.3115023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3115031Z 2025-05-07T20:32:16.3115130Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3115345Z self=, 2025-05-07T20:32:16.3115424Z T=1, 2025-05-07T20:32:16.3115500Z D=7168, 2025-05-07T20:32:16.3115583Z scale_ub=1200.0, 2025-05-07T20:32:16.3115674Z contiguous=True, 2025-05-07T20:32:16.3115756Z compiled=True, 2025-05-07T20:32:16.3115832Z ) 2025-05-07T20:32:16.3116048Z self = 2025-05-07T20:32:16.3116207Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.3116217Z 2025-05-07T20:32:16.3116295Z @given( 2025-05-07T20:32:16.3116419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3116515Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3116635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3116749Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3116860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3116938Z ) 2025-05-07T20:32:16.3117182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3117271Z def test_silu_mul_quant( 2025-05-07T20:32:16.3117343Z self, 2025-05-07T20:32:16.3117417Z T: int, 2025-05-07T20:32:16.3117492Z D: int, 2025-05-07T20:32:16.3117591Z scale_ub: Optional[float], 2025-05-07T20:32:16.3117678Z contiguous: bool, 2025-05-07T20:32:16.3117766Z compiled: bool, 2025-05-07T20:32:16.3117844Z ) -> None: 2025-05-07T20:32:16.3117940Z torch.manual_seed(2025) 2025-05-07T20:32:16.3118021Z 2025-05-07T20:32:16.3118209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3118283Z 2025-05-07T20:32:16.3118397Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3118524Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3118611Z x = x_sign * x_clamp 2025-05-07T20:32:16.3118691Z x0 = x[:, :D] 2025-05-07T20:32:16.3118768Z x1 = x[:, D:] 2025-05-07T20:32:16.3118840Z 2025-05-07T20:32:16.3119007Z if contiguous: 2025-05-07T20:32:16.3119100Z x0 = x0.contiguous() 2025-05-07T20:32:16.3119187Z x1 = x1.contiguous() 2025-05-07T20:32:16.3119258Z 2025-05-07T20:32:16.3119345Z if scale_ub is not None: 2025-05-07T20:32:16.3119449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3119580Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3119652Z ) 2025-05-07T20:32:16.3119807Z else: 2025-05-07T20:32:16.3119899Z scale_ub_tensor = None 2025-05-07T20:32:16.3119970Z 2025-05-07T20:32:16.3120102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3120190Z op = silu_mul_quant 2025-05-07T20:32:16.3120269Z if compiled: 2025-05-07T20:32:16.3120370Z op = torch.compile(op) 2025-05-07T20:32:16.3120470Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3120540Z 2025-05-07T20:32:16.3120632Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3120642Z 2025-05-07T20:32:16.3120735Z moe/activation_test.py:117: 2025-05-07T20:32:16.3120866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3120963Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3121058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3121423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3121516Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3122000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3122099Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3122445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3122665Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3122998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3123088Z kernel = self.compile( 2025-05-07T20:32:16.3123461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3123628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3123759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3123764Z 2025-05-07T20:32:16.3123963Z self = 2025-05-07T20:32:16.3124719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3125219Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a42ce42c0>} 2025-05-07T20:32:16.3125953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3126140Z context = 2025-05-07T20:32:16.3126149Z 2025-05-07T20:32:16.3126308Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3126562Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3126668Z module_map=module_map) 2025-05-07T20:32:16.3126830Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3126928Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3127004Z E ^ 2025-05-07T20:32:16.3127431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3127436Z 2025-05-07T20:32:16.3127844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3127849Z 2025-05-07T20:32:16.3127946Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3128171Z self=, 2025-05-07T20:32:16.3128342Z T=1, 2025-05-07T20:32:16.3128420Z D=7168, 2025-05-07T20:32:16.3128520Z scale_ub=1200.0, 2025-05-07T20:32:16.3128607Z contiguous=False, 2025-05-07T20:32:16.3128688Z compiled=True, 2025-05-07T20:32:16.3128761Z ) 2025-05-07T20:32:16.3128971Z self = 2025-05-07T20:32:16.3129131Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3129135Z 2025-05-07T20:32:16.3129215Z @given( 2025-05-07T20:32:16.3129331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3129433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3129543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3129656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3129766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3129837Z ) 2025-05-07T20:32:16.3130082Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3130175Z def test_silu_mul_quant( 2025-05-07T20:32:16.3130246Z self, 2025-05-07T20:32:16.3130322Z T: int, 2025-05-07T20:32:16.3130400Z D: int, 2025-05-07T20:32:16.3130495Z scale_ub: Optional[float], 2025-05-07T20:32:16.3130579Z contiguous: bool, 2025-05-07T20:32:16.3130662Z compiled: bool, 2025-05-07T20:32:16.3130738Z ) -> None: 2025-05-07T20:32:16.3130834Z torch.manual_seed(2025) 2025-05-07T20:32:16.3130910Z 2025-05-07T20:32:16.3131074Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3131149Z 2025-05-07T20:32:16.3131236Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3131356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3131445Z x = x_sign * x_clamp 2025-05-07T20:32:16.3131522Z x0 = x[:, :D] 2025-05-07T20:32:16.3131604Z x1 = x[:, D:] 2025-05-07T20:32:16.3131678Z 2025-05-07T20:32:16.3131757Z if contiguous: 2025-05-07T20:32:16.3131848Z x0 = x0.contiguous() 2025-05-07T20:32:16.3131938Z x1 = x1.contiguous() 2025-05-07T20:32:16.3132008Z 2025-05-07T20:32:16.3132093Z if scale_ub is not None: 2025-05-07T20:32:16.3132197Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3132327Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3132410Z ) 2025-05-07T20:32:16.3132483Z else: 2025-05-07T20:32:16.3132579Z scale_ub_tensor = None 2025-05-07T20:32:16.3132650Z 2025-05-07T20:32:16.3132776Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3132863Z op = silu_mul_quant 2025-05-07T20:32:16.3132949Z if compiled: 2025-05-07T20:32:16.3133144Z op = torch.compile(op) 2025-05-07T20:32:16.3133247Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3133323Z 2025-05-07T20:32:16.3133409Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3133414Z 2025-05-07T20:32:16.3133510Z moe/activation_test.py:117: 2025-05-07T20:32:16.3133637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3133733Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3133830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3134191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3134388Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3134876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3134971Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3135322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3135618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3135947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3136044Z kernel = self.compile( 2025-05-07T20:32:16.3136417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3136588Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3136720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3136725Z 2025-05-07T20:32:16.3136924Z self = 2025-05-07T20:32:16.3137682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3138182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a42ce5120>} 2025-05-07T20:32:16.3138910Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3139095Z context = 2025-05-07T20:32:16.3139103Z 2025-05-07T20:32:16.3139263Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3139520Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3139625Z module_map=module_map) 2025-05-07T20:32:16.3139779Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3139879Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3139955Z E ^ 2025-05-07T20:32:16.3140303Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3140307Z 2025-05-07T20:32:16.3140709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3140714Z 2025-05-07T20:32:16.3140812Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3141034Z self=, 2025-05-07T20:32:16.3141110Z T=1, 2025-05-07T20:32:16.3141187Z D=7168, 2025-05-07T20:32:16.3141267Z scale_ub=None, 2025-05-07T20:32:16.3141350Z contiguous=False, 2025-05-07T20:32:16.3141432Z compiled=True, 2025-05-07T20:32:16.3141502Z ) 2025-05-07T20:32:16.3141714Z self = 2025-05-07T20:32:16.3141878Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3141887Z 2025-05-07T20:32:16.3141960Z @given( 2025-05-07T20:32:16.3142075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3142173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3142283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3142404Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3142514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3142585Z ) 2025-05-07T20:32:16.3142908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3143005Z def test_silu_mul_quant( 2025-05-07T20:32:16.3143081Z self, 2025-05-07T20:32:16.3143159Z T: int, 2025-05-07T20:32:16.3143236Z D: int, 2025-05-07T20:32:16.3143336Z scale_ub: Optional[float], 2025-05-07T20:32:16.3143431Z contiguous: bool, 2025-05-07T20:32:16.3143518Z compiled: bool, 2025-05-07T20:32:16.3143669Z ) -> None: 2025-05-07T20:32:16.3143769Z torch.manual_seed(2025) 2025-05-07T20:32:16.3143839Z 2025-05-07T20:32:16.3144021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3144098Z 2025-05-07T20:32:16.3144191Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3144326Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3144415Z x = x_sign * x_clamp 2025-05-07T20:32:16.3144494Z x0 = x[:, :D] 2025-05-07T20:32:16.3144578Z x1 = x[:, D:] 2025-05-07T20:32:16.3144655Z 2025-05-07T20:32:16.3144739Z if contiguous: 2025-05-07T20:32:16.3144837Z x0 = x0.contiguous() 2025-05-07T20:32:16.3144927Z x1 = x1.contiguous() 2025-05-07T20:32:16.3145001Z 2025-05-07T20:32:16.3145097Z if scale_ub is not None: 2025-05-07T20:32:16.3145205Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3145347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3145430Z ) 2025-05-07T20:32:16.3145506Z else: 2025-05-07T20:32:16.3145603Z scale_ub_tensor = None 2025-05-07T20:32:16.3145675Z 2025-05-07T20:32:16.3145808Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3145900Z op = silu_mul_quant 2025-05-07T20:32:16.3145983Z if compiled: 2025-05-07T20:32:16.3146085Z op = torch.compile(op) 2025-05-07T20:32:16.3146196Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3146266Z 2025-05-07T20:32:16.3146365Z y_fp8, y_scale = fn() 2025-05-07T20:32:16.3146498Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:16.3146568Z 2025-05-07T20:32:16.3146711Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3146819Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:16.3146921Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:16.3147050Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:16.3147202Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.3147274Z 2025-05-07T20:32:16.3147378Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:16.3147382Z 2025-05-07T20:32:16.3147486Z moe/activation_test.py:126: 2025-05-07T20:32:16.3147622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3147734Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:16.3147879Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:16.3148611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:16.3148714Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:16.3149136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3149390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3149825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:16.3150121Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:16.3150568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:16.3150750Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:16.3151292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:16.3151369Z fn() 2025-05-07T20:32:16.3151759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:16.3151841Z self.fn.run( 2025-05-07T20:32:16.3152170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3152338Z kernel = self.compile( 2025-05-07T20:32:16.3152707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3152875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3153002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3153006Z 2025-05-07T20:32:16.3153209Z self = 2025-05-07T20:32:16.3153965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3154462Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a438d7420>} 2025-05-07T20:32:16.3155193Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3155380Z context = 2025-05-07T20:32:16.3155385Z 2025-05-07T20:32:16.3155547Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3155808Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3155912Z module_map=module_map) 2025-05-07T20:32:16.3156070Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3156174Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:16.3156250Z E ^ 2025-05-07T20:32:16.3156593Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3156601Z 2025-05-07T20:32:16.3157009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3157014Z 2025-05-07T20:32:16.3157113Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3157332Z self=, 2025-05-07T20:32:16.3157404Z T=1, 2025-05-07T20:32:16.3157478Z D=5120, 2025-05-07T20:32:16.3157562Z scale_ub=1200.0, 2025-05-07T20:32:16.3157649Z contiguous=False, 2025-05-07T20:32:16.3157730Z compiled=True, 2025-05-07T20:32:16.3157803Z ) 2025-05-07T20:32:16.3158016Z self = 2025-05-07T20:32:16.3158183Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3158187Z 2025-05-07T20:32:16.3158259Z @given( 2025-05-07T20:32:16.3158375Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3158477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3158590Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3158704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3158816Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3158888Z ) 2025-05-07T20:32:16.3159126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3159448Z def test_silu_mul_quant( 2025-05-07T20:32:16.3159723Z self, 2025-05-07T20:32:16.3159809Z T: int, 2025-05-07T20:32:16.3159886Z D: int, 2025-05-07T20:32:16.3159981Z scale_ub: Optional[float], 2025-05-07T20:32:16.3160067Z contiguous: bool, 2025-05-07T20:32:16.3160148Z compiled: bool, 2025-05-07T20:32:16.3160220Z ) -> None: 2025-05-07T20:32:16.3160313Z torch.manual_seed(2025) 2025-05-07T20:32:16.3160383Z 2025-05-07T20:32:16.3160664Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3160741Z 2025-05-07T20:32:16.3160829Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3160954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3161045Z x = x_sign * x_clamp 2025-05-07T20:32:16.3161120Z x0 = x[:, :D] 2025-05-07T20:32:16.3161197Z x1 = x[:, D:] 2025-05-07T20:32:16.3161270Z 2025-05-07T20:32:16.3161350Z if contiguous: 2025-05-07T20:32:16.3161441Z x0 = x0.contiguous() 2025-05-07T20:32:16.3161533Z x1 = x1.contiguous() 2025-05-07T20:32:16.3161607Z 2025-05-07T20:32:16.3161699Z if scale_ub is not None: 2025-05-07T20:32:16.3161799Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3161930Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3162005Z ) 2025-05-07T20:32:16.3162078Z else: 2025-05-07T20:32:16.3162169Z scale_ub_tensor = None 2025-05-07T20:32:16.3162245Z 2025-05-07T20:32:16.3162370Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3162456Z op = silu_mul_quant 2025-05-07T20:32:16.3162540Z if compiled: 2025-05-07T20:32:16.3162635Z op = torch.compile(op) 2025-05-07T20:32:16.3162737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3162805Z 2025-05-07T20:32:16.3162894Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3162899Z 2025-05-07T20:32:16.3163002Z moe/activation_test.py:117: 2025-05-07T20:32:16.3163136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3163237Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3163335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3163698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3163787Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3164275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3164372Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3164718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3164933Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3165273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3165364Z kernel = self.compile( 2025-05-07T20:32:16.3165740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3165910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3166031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3166041Z 2025-05-07T20:32:16.3166244Z self = 2025-05-07T20:32:16.3166997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3167600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43a7fa60>} 2025-05-07T20:32:16.3168329Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3168514Z context = 2025-05-07T20:32:16.3168522Z 2025-05-07T20:32:16.3168683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3169011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3169123Z module_map=module_map) 2025-05-07T20:32:16.3169284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3169380Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3169461Z E ^ 2025-05-07T20:32:16.3169809Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3169814Z 2025-05-07T20:32:16.3170219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3170224Z 2025-05-07T20:32:16.3170321Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3170540Z self=, 2025-05-07T20:32:16.3170614Z T=1, 2025-05-07T20:32:16.3170693Z D=5120, 2025-05-07T20:32:16.3170772Z scale_ub=1200.0, 2025-05-07T20:32:16.3170858Z contiguous=False, 2025-05-07T20:32:16.3170942Z compiled=False, 2025-05-07T20:32:16.3171009Z ) 2025-05-07T20:32:16.3171222Z self = 2025-05-07T20:32:16.3171384Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.3171389Z 2025-05-07T20:32:16.3171462Z @given( 2025-05-07T20:32:16.3171576Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3171675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3171791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3171904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3172013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3172086Z ) 2025-05-07T20:32:16.3172325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3172422Z def test_silu_mul_quant( 2025-05-07T20:32:16.3172500Z self, 2025-05-07T20:32:16.3172572Z T: int, 2025-05-07T20:32:16.3172647Z D: int, 2025-05-07T20:32:16.3172742Z scale_ub: Optional[float], 2025-05-07T20:32:16.3172828Z contiguous: bool, 2025-05-07T20:32:16.3172911Z compiled: bool, 2025-05-07T20:32:16.3172985Z ) -> None: 2025-05-07T20:32:16.3173127Z torch.manual_seed(2025) 2025-05-07T20:32:16.3173203Z 2025-05-07T20:32:16.3173368Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3173439Z 2025-05-07T20:32:16.3173530Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3173653Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3173737Z x = x_sign * x_clamp 2025-05-07T20:32:16.3173818Z x0 = x[:, :D] 2025-05-07T20:32:16.3173896Z x1 = x[:, D:] 2025-05-07T20:32:16.3173971Z 2025-05-07T20:32:16.3174050Z if contiguous: 2025-05-07T20:32:16.3174143Z x0 = x0.contiguous() 2025-05-07T20:32:16.3174231Z x1 = x1.contiguous() 2025-05-07T20:32:16.3174299Z 2025-05-07T20:32:16.3174387Z if scale_ub is not None: 2025-05-07T20:32:16.3174490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3174621Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3174693Z ) 2025-05-07T20:32:16.3174765Z else: 2025-05-07T20:32:16.3174855Z scale_ub_tensor = None 2025-05-07T20:32:16.3174925Z 2025-05-07T20:32:16.3175137Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3175226Z op = silu_mul_quant 2025-05-07T20:32:16.3175306Z if compiled: 2025-05-07T20:32:16.3175407Z op = torch.compile(op) 2025-05-07T20:32:16.3175509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3175582Z 2025-05-07T20:32:16.3175669Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3175749Z 2025-05-07T20:32:16.3175844Z moe/activation_test.py:117: 2025-05-07T20:32:16.3175971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3176067Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3176160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3176652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3176747Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3177104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3177322Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3177653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3177747Z kernel = self.compile( 2025-05-07T20:32:16.3178125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3178318Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3178467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3178473Z 2025-05-07T20:32:16.3178671Z self = 2025-05-07T20:32:16.3179435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3179927Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43e0f6a0>} 2025-05-07T20:32:16.3180657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3180847Z context = 2025-05-07T20:32:16.3180852Z 2025-05-07T20:32:16.3181010Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3181268Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3181377Z module_map=module_map) 2025-05-07T20:32:16.3181537Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3181632Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3181705Z E ^ 2025-05-07T20:32:16.3182054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3182058Z 2025-05-07T20:32:16.3182461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3182470Z 2025-05-07T20:32:16.3182570Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3182788Z self=, 2025-05-07T20:32:16.3182862Z T=16384, 2025-05-07T20:32:16.3182940Z D=5120, 2025-05-07T20:32:16.3183018Z scale_ub=1200.0, 2025-05-07T20:32:16.3183101Z contiguous=False, 2025-05-07T20:32:16.3183187Z compiled=True, 2025-05-07T20:32:16.3183255Z ) 2025-05-07T20:32:16.3183547Z self = 2025-05-07T20:32:16.3183724Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3183729Z 2025-05-07T20:32:16.3183804Z @given( 2025-05-07T20:32:16.3183919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3184017Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3184201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3184319Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3184430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3184500Z ) 2025-05-07T20:32:16.3184746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3184838Z def test_silu_mul_quant( 2025-05-07T20:32:16.3184908Z self, 2025-05-07T20:32:16.3184989Z T: int, 2025-05-07T20:32:16.3185062Z D: int, 2025-05-07T20:32:16.3185161Z scale_ub: Optional[float], 2025-05-07T20:32:16.3185251Z contiguous: bool, 2025-05-07T20:32:16.3185332Z compiled: bool, 2025-05-07T20:32:16.3185407Z ) -> None: 2025-05-07T20:32:16.3185502Z torch.manual_seed(2025) 2025-05-07T20:32:16.3185574Z 2025-05-07T20:32:16.3185743Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3185813Z 2025-05-07T20:32:16.3185908Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3186032Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3186118Z x = x_sign * x_clamp 2025-05-07T20:32:16.3186193Z x0 = x[:, :D] 2025-05-07T20:32:16.3186275Z x1 = x[:, D:] 2025-05-07T20:32:16.3186346Z 2025-05-07T20:32:16.3186424Z if contiguous: 2025-05-07T20:32:16.3186515Z x0 = x0.contiguous() 2025-05-07T20:32:16.3186601Z x1 = x1.contiguous() 2025-05-07T20:32:16.3186668Z 2025-05-07T20:32:16.3186765Z if scale_ub is not None: 2025-05-07T20:32:16.3186867Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3187002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3187075Z ) 2025-05-07T20:32:16.3187149Z else: 2025-05-07T20:32:16.3187241Z scale_ub_tensor = None 2025-05-07T20:32:16.3187311Z 2025-05-07T20:32:16.3187437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3187533Z op = silu_mul_quant 2025-05-07T20:32:16.3187614Z if compiled: 2025-05-07T20:32:16.3187710Z op = torch.compile(op) 2025-05-07T20:32:16.3187814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3187883Z 2025-05-07T20:32:16.3187970Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3187979Z 2025-05-07T20:32:16.3188075Z moe/activation_test.py:117: 2025-05-07T20:32:16.3188198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3188302Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3188396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3188753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3188843Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3189324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3189421Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3189772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3189988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3190322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3190411Z kernel = self.compile( 2025-05-07T20:32:16.3190870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3191044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3191167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3191171Z 2025-05-07T20:32:16.3191372Z self = 2025-05-07T20:32:16.3192228Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3192722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43f32b60>} 2025-05-07T20:32:16.3193462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3193646Z context = 2025-05-07T20:32:16.3193651Z 2025-05-07T20:32:16.3193816Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3194069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3194176Z module_map=module_map) 2025-05-07T20:32:16.3194336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3194433Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3194514Z E ^ 2025-05-07T20:32:16.3194860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3194865Z 2025-05-07T20:32:16.3195270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3195274Z 2025-05-07T20:32:16.3195375Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3195592Z self=, 2025-05-07T20:32:16.3195667Z T=2048, 2025-05-07T20:32:16.3195742Z D=7168, 2025-05-07T20:32:16.3195822Z scale_ub=1200.0, 2025-05-07T20:32:16.3195906Z contiguous=False, 2025-05-07T20:32:16.3195990Z compiled=True, 2025-05-07T20:32:16.3196062Z ) 2025-05-07T20:32:16.3196280Z self = 2025-05-07T20:32:16.3196449Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3196453Z 2025-05-07T20:32:16.3196528Z @given( 2025-05-07T20:32:16.3196648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3196742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3196859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3196976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3197087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3197162Z ) 2025-05-07T20:32:16.3197400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3197489Z def test_silu_mul_quant( 2025-05-07T20:32:16.3197568Z self, 2025-05-07T20:32:16.3197647Z T: int, 2025-05-07T20:32:16.3197721Z D: int, 2025-05-07T20:32:16.3197822Z scale_ub: Optional[float], 2025-05-07T20:32:16.3197907Z contiguous: bool, 2025-05-07T20:32:16.3197989Z compiled: bool, 2025-05-07T20:32:16.3198067Z ) -> None: 2025-05-07T20:32:16.3198160Z torch.manual_seed(2025) 2025-05-07T20:32:16.3198232Z 2025-05-07T20:32:16.3198403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3198475Z 2025-05-07T20:32:16.3198569Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3198768Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3198855Z x = x_sign * x_clamp 2025-05-07T20:32:16.3198936Z x0 = x[:, :D] 2025-05-07T20:32:16.3199012Z x1 = x[:, D:] 2025-05-07T20:32:16.3199080Z 2025-05-07T20:32:16.3199163Z if contiguous: 2025-05-07T20:32:16.3199250Z x0 = x0.contiguous() 2025-05-07T20:32:16.3199334Z x1 = x1.contiguous() 2025-05-07T20:32:16.3199487Z 2025-05-07T20:32:16.3199575Z if scale_ub is not None: 2025-05-07T20:32:16.3199677Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3199810Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3199880Z ) 2025-05-07T20:32:16.3199958Z else: 2025-05-07T20:32:16.3200046Z scale_ub_tensor = None 2025-05-07T20:32:16.3200114Z 2025-05-07T20:32:16.3200246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3200336Z op = silu_mul_quant 2025-05-07T20:32:16.3200417Z if compiled: 2025-05-07T20:32:16.3200517Z op = torch.compile(op) 2025-05-07T20:32:16.3200619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3200690Z 2025-05-07T20:32:16.3200782Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3200786Z 2025-05-07T20:32:16.3200880Z moe/activation_test.py:117: 2025-05-07T20:32:16.3201005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3201109Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3201205Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3201570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3201656Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3202138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3202248Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3202595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3202819Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3203149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3203246Z kernel = self.compile( 2025-05-07T20:32:16.3203620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3203789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3203909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3203913Z 2025-05-07T20:32:16.3204121Z self = 2025-05-07T20:32:16.3204881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3205376Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43f32200>} 2025-05-07T20:32:16.3206106Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3206291Z context = 2025-05-07T20:32:16.3206295Z 2025-05-07T20:32:16.3206455Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3206788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3206896Z module_map=module_map) 2025-05-07T20:32:16.3207051Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3207144Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3207219Z E ^ 2025-05-07T20:32:16.3207562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3207642Z 2025-05-07T20:32:16.3208049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3208053Z 2025-05-07T20:32:16.3208168Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3208418Z self=, 2025-05-07T20:32:16.3208492Z T=1, 2025-05-07T20:32:16.3208567Z D=5120, 2025-05-07T20:32:16.3208647Z scale_ub=None, 2025-05-07T20:32:16.3208734Z contiguous=False, 2025-05-07T20:32:16.3208819Z compiled=False, 2025-05-07T20:32:16.3208893Z ) 2025-05-07T20:32:16.3209103Z self = 2025-05-07T20:32:16.3209263Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.3209267Z 2025-05-07T20:32:16.3209344Z @given( 2025-05-07T20:32:16.3209458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3209558Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3209671Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3209785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3209899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3209971Z ) 2025-05-07T20:32:16.3210211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3210303Z def test_silu_mul_quant( 2025-05-07T20:32:16.3210374Z self, 2025-05-07T20:32:16.3210448Z T: int, 2025-05-07T20:32:16.3210529Z D: int, 2025-05-07T20:32:16.3210623Z scale_ub: Optional[float], 2025-05-07T20:32:16.3210707Z contiguous: bool, 2025-05-07T20:32:16.3210795Z compiled: bool, 2025-05-07T20:32:16.3210868Z ) -> None: 2025-05-07T20:32:16.3210958Z torch.manual_seed(2025) 2025-05-07T20:32:16.3211032Z 2025-05-07T20:32:16.3211196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3211275Z 2025-05-07T20:32:16.3211367Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3211488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3211576Z x = x_sign * x_clamp 2025-05-07T20:32:16.3211652Z x0 = x[:, :D] 2025-05-07T20:32:16.3211727Z x1 = x[:, D:] 2025-05-07T20:32:16.3211807Z 2025-05-07T20:32:16.3211889Z if contiguous: 2025-05-07T20:32:16.3211974Z x0 = x0.contiguous() 2025-05-07T20:32:16.3212062Z x1 = x1.contiguous() 2025-05-07T20:32:16.3212137Z 2025-05-07T20:32:16.3212226Z if scale_ub is not None: 2025-05-07T20:32:16.3212329Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3212459Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3212533Z ) 2025-05-07T20:32:16.3212607Z else: 2025-05-07T20:32:16.3212699Z scale_ub_tensor = None 2025-05-07T20:32:16.3212772Z 2025-05-07T20:32:16.3212900Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3212986Z op = silu_mul_quant 2025-05-07T20:32:16.3213125Z if compiled: 2025-05-07T20:32:16.3213221Z op = torch.compile(op) 2025-05-07T20:32:16.3213321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3213395Z 2025-05-07T20:32:16.3213481Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3213486Z 2025-05-07T20:32:16.3213577Z moe/activation_test.py:117: 2025-05-07T20:32:16.3213788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3213888Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3213987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3214473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3214566Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3214919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3215213Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3215543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3215636Z kernel = self.compile( 2025-05-07T20:32:16.3216007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3216182Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3216302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3216307Z 2025-05-07T20:32:16.3216504Z self = 2025-05-07T20:32:16.3217260Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3217757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a581c39c0>} 2025-05-07T20:32:16.3218490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3218679Z context = 2025-05-07T20:32:16.3218684Z 2025-05-07T20:32:16.3218846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3219099Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3219207Z module_map=module_map) 2025-05-07T20:32:16.3219372Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3219466Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3219541Z E ^ 2025-05-07T20:32:16.3219888Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3219893Z 2025-05-07T20:32:16.3220294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3220299Z 2025-05-07T20:32:16.3220404Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3220620Z self=, 2025-05-07T20:32:16.3220695Z T=4096, 2025-05-07T20:32:16.3220773Z D=7168, 2025-05-07T20:32:16.3220853Z scale_ub=1200.0, 2025-05-07T20:32:16.3220938Z contiguous=False, 2025-05-07T20:32:16.3221020Z compiled=False, 2025-05-07T20:32:16.3221090Z ) 2025-05-07T20:32:16.3221300Z self = 2025-05-07T20:32:16.3221477Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.3221482Z 2025-05-07T20:32:16.3221554Z @given( 2025-05-07T20:32:16.3221671Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3221767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3221875Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3221992Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3222183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3222254Z ) 2025-05-07T20:32:16.3222497Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3222590Z def test_silu_mul_quant( 2025-05-07T20:32:16.3222663Z self, 2025-05-07T20:32:16.3222743Z T: int, 2025-05-07T20:32:16.3223059Z D: int, 2025-05-07T20:32:16.3223193Z scale_ub: Optional[float], 2025-05-07T20:32:16.3223394Z contiguous: bool, 2025-05-07T20:32:16.3223547Z compiled: bool, 2025-05-07T20:32:16.3223680Z ) -> None: 2025-05-07T20:32:16.3227678Z torch.manual_seed(2025) 2025-05-07T20:32:16.3227762Z 2025-05-07T20:32:16.3227939Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3228017Z 2025-05-07T20:32:16.3228108Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3228242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3228355Z x = x_sign * x_clamp 2025-05-07T20:32:16.3228447Z x0 = x[:, :D] 2025-05-07T20:32:16.3228536Z x1 = x[:, D:] 2025-05-07T20:32:16.3228609Z 2025-05-07T20:32:16.3228690Z if contiguous: 2025-05-07T20:32:16.3228778Z x0 = x0.contiguous() 2025-05-07T20:32:16.3228870Z x1 = x1.contiguous() 2025-05-07T20:32:16.3228940Z 2025-05-07T20:32:16.3229032Z if scale_ub is not None: 2025-05-07T20:32:16.3229149Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3229296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3229371Z ) 2025-05-07T20:32:16.3229448Z else: 2025-05-07T20:32:16.3229543Z scale_ub_tensor = None 2025-05-07T20:32:16.3229618Z 2025-05-07T20:32:16.3229754Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3229844Z op = silu_mul_quant 2025-05-07T20:32:16.3229934Z if compiled: 2025-05-07T20:32:16.3230039Z op = torch.compile(op) 2025-05-07T20:32:16.3230147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3230222Z 2025-05-07T20:32:16.3230312Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3230317Z 2025-05-07T20:32:16.3230420Z moe/activation_test.py:117: 2025-05-07T20:32:16.3230560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3230662Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3230776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3231376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3231470Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3231830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3232046Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3232388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3232481Z kernel = self.compile( 2025-05-07T20:32:16.3232853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3233026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3233153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3233157Z 2025-05-07T20:32:16.3233363Z self = 2025-05-07T20:32:16.3234121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3234717Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a583a49a0>} 2025-05-07T20:32:16.3235452Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3235636Z context = 2025-05-07T20:32:16.3235717Z 2025-05-07T20:32:16.3235883Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3236139Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3236243Z module_map=module_map) 2025-05-07T20:32:16.3236407Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3236504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3236577Z E ^ 2025-05-07T20:32:16.3236931Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3236936Z 2025-05-07T20:32:16.3237337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3237342Z 2025-05-07T20:32:16.3237445Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3237660Z self=, 2025-05-07T20:32:16.3237741Z T=16384, 2025-05-07T20:32:16.3237822Z D=7168, 2025-05-07T20:32:16.3237905Z scale_ub=None, 2025-05-07T20:32:16.3237994Z contiguous=True, 2025-05-07T20:32:16.3238089Z compiled=True, 2025-05-07T20:32:16.3238170Z ) 2025-05-07T20:32:16.3238411Z self = 2025-05-07T20:32:16.3238582Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.3238587Z 2025-05-07T20:32:16.3238658Z @given( 2025-05-07T20:32:16.3238783Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3238880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3238990Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3239109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3239220Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3239295Z ) 2025-05-07T20:32:16.3239540Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3239631Z def test_silu_mul_quant( 2025-05-07T20:32:16.3239712Z self, 2025-05-07T20:32:16.3239787Z T: int, 2025-05-07T20:32:16.3239861Z D: int, 2025-05-07T20:32:16.3239961Z scale_ub: Optional[float], 2025-05-07T20:32:16.3240047Z contiguous: bool, 2025-05-07T20:32:16.3240130Z compiled: bool, 2025-05-07T20:32:16.3240212Z ) -> None: 2025-05-07T20:32:16.3240304Z torch.manual_seed(2025) 2025-05-07T20:32:16.3240378Z 2025-05-07T20:32:16.3240543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3240614Z 2025-05-07T20:32:16.3240703Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3240830Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3240914Z x = x_sign * x_clamp 2025-05-07T20:32:16.3240996Z x0 = x[:, :D] 2025-05-07T20:32:16.3241081Z x1 = x[:, D:] 2025-05-07T20:32:16.3241152Z 2025-05-07T20:32:16.3241234Z if contiguous: 2025-05-07T20:32:16.3241323Z x0 = x0.contiguous() 2025-05-07T20:32:16.3241407Z x1 = x1.contiguous() 2025-05-07T20:32:16.3241476Z 2025-05-07T20:32:16.3241563Z if scale_ub is not None: 2025-05-07T20:32:16.3241664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3241798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3241873Z ) 2025-05-07T20:32:16.3241944Z else: 2025-05-07T20:32:16.3242119Z scale_ub_tensor = None 2025-05-07T20:32:16.3242191Z 2025-05-07T20:32:16.3242323Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3242413Z op = silu_mul_quant 2025-05-07T20:32:16.3242493Z if compiled: 2025-05-07T20:32:16.3242590Z op = torch.compile(op) 2025-05-07T20:32:16.3242691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3242904Z 2025-05-07T20:32:16.3242998Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3243002Z 2025-05-07T20:32:16.3243095Z moe/activation_test.py:117: 2025-05-07T20:32:16.3243220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3243321Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3243415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3243780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3243874Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3244355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3244454Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3244801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3245017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3245356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3245446Z kernel = self.compile( 2025-05-07T20:32:16.3245819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3245989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3246116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3246121Z 2025-05-07T20:32:16.3246323Z self = 2025-05-07T20:32:16.3247081Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3247583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59351da0>} 2025-05-07T20:32:16.3248316Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3248501Z context = 2025-05-07T20:32:16.3248508Z 2025-05-07T20:32:16.3248673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3248928Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3249037Z module_map=module_map) 2025-05-07T20:32:16.3249193Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3249287Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3249369Z E ^ 2025-05-07T20:32:16.3249712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3249717Z 2025-05-07T20:32:16.3250122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3250126Z 2025-05-07T20:32:16.3250225Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3250522Z self=, 2025-05-07T20:32:16.3250598Z T=4096, 2025-05-07T20:32:16.3250674Z D=5120, 2025-05-07T20:32:16.3250755Z scale_ub=None, 2025-05-07T20:32:16.3250838Z contiguous=False, 2025-05-07T20:32:16.3250914Z compiled=True, 2025-05-07T20:32:16.3250985Z ) 2025-05-07T20:32:16.3251204Z self = 2025-05-07T20:32:16.3251373Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3251452Z 2025-05-07T20:32:16.3251533Z @given( 2025-05-07T20:32:16.3251648Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3251746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3251858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3251972Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3252081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3252154Z ) 2025-05-07T20:32:16.3252397Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3252491Z def test_silu_mul_quant( 2025-05-07T20:32:16.3252567Z self, 2025-05-07T20:32:16.3252638Z T: int, 2025-05-07T20:32:16.3252712Z D: int, 2025-05-07T20:32:16.3252808Z scale_ub: Optional[float], 2025-05-07T20:32:16.3252895Z contiguous: bool, 2025-05-07T20:32:16.3252983Z compiled: bool, 2025-05-07T20:32:16.3253138Z ) -> None: 2025-05-07T20:32:16.3253228Z torch.manual_seed(2025) 2025-05-07T20:32:16.3253301Z 2025-05-07T20:32:16.3253463Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3253532Z 2025-05-07T20:32:16.3253626Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3253747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3253833Z x = x_sign * x_clamp 2025-05-07T20:32:16.3253914Z x0 = x[:, :D] 2025-05-07T20:32:16.3253990Z x1 = x[:, D:] 2025-05-07T20:32:16.3254071Z 2025-05-07T20:32:16.3254151Z if contiguous: 2025-05-07T20:32:16.3254237Z x0 = x0.contiguous() 2025-05-07T20:32:16.3254323Z x1 = x1.contiguous() 2025-05-07T20:32:16.3254392Z 2025-05-07T20:32:16.3254478Z if scale_ub is not None: 2025-05-07T20:32:16.3254587Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3254716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3254795Z ) 2025-05-07T20:32:16.3254870Z else: 2025-05-07T20:32:16.3254963Z scale_ub_tensor = None 2025-05-07T20:32:16.3255035Z 2025-05-07T20:32:16.3255166Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3255253Z op = silu_mul_quant 2025-05-07T20:32:16.3255338Z if compiled: 2025-05-07T20:32:16.3255435Z op = torch.compile(op) 2025-05-07T20:32:16.3255537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3255612Z 2025-05-07T20:32:16.3255704Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3255708Z 2025-05-07T20:32:16.3255801Z moe/activation_test.py:117: 2025-05-07T20:32:16.3255930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3256030Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3256124Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3256485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3256580Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3257063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3257155Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3257501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3257831Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3258163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3258258Z kernel = self.compile( 2025-05-07T20:32:16.3258630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3258799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3259004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3259008Z 2025-05-07T20:32:16.3259411Z self = 2025-05-07T20:32:16.3260244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3260751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59353b00>} 2025-05-07T20:32:16.3261481Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3261674Z context = 2025-05-07T20:32:16.3261679Z 2025-05-07T20:32:16.3261840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3262093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3262198Z module_map=module_map) 2025-05-07T20:32:16.3262356Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3262455Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3262534Z E ^ 2025-05-07T20:32:16.3262880Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3262884Z 2025-05-07T20:32:16.3263289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3263293Z 2025-05-07T20:32:16.3263394Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3263618Z self=, 2025-05-07T20:32:16.3263693Z T=4096, 2025-05-07T20:32:16.3263764Z D=5120, 2025-05-07T20:32:16.3263846Z scale_ub=1200.0, 2025-05-07T20:32:16.3263930Z contiguous=False, 2025-05-07T20:32:16.3264011Z compiled=False, 2025-05-07T20:32:16.3264086Z ) 2025-05-07T20:32:16.3264298Z self = 2025-05-07T20:32:16.3264474Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.3264483Z 2025-05-07T20:32:16.3264560Z @given( 2025-05-07T20:32:16.3264676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3264777Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3264888Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3265000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3265114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3265191Z ) 2025-05-07T20:32:16.3265431Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3265523Z def test_silu_mul_quant( 2025-05-07T20:32:16.3265601Z self, 2025-05-07T20:32:16.3265682Z T: int, 2025-05-07T20:32:16.3265755Z D: int, 2025-05-07T20:32:16.3265851Z scale_ub: Optional[float], 2025-05-07T20:32:16.3265943Z contiguous: bool, 2025-05-07T20:32:16.3266024Z compiled: bool, 2025-05-07T20:32:16.3266101Z ) -> None: 2025-05-07T20:32:16.3266353Z torch.manual_seed(2025) 2025-05-07T20:32:16.3266429Z 2025-05-07T20:32:16.3266597Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3266672Z 2025-05-07T20:32:16.3266761Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3266881Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3266974Z x = x_sign * x_clamp 2025-05-07T20:32:16.3267163Z x0 = x[:, :D] 2025-05-07T20:32:16.3267242Z x1 = x[:, D:] 2025-05-07T20:32:16.3267318Z 2025-05-07T20:32:16.3267398Z if contiguous: 2025-05-07T20:32:16.3267488Z x0 = x0.contiguous() 2025-05-07T20:32:16.3267575Z x1 = x1.contiguous() 2025-05-07T20:32:16.3267645Z 2025-05-07T20:32:16.3267736Z if scale_ub is not None: 2025-05-07T20:32:16.3267840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3267972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3268057Z ) 2025-05-07T20:32:16.3268149Z else: 2025-05-07T20:32:16.3268247Z scale_ub_tensor = None 2025-05-07T20:32:16.3268339Z 2025-05-07T20:32:16.3268464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3268551Z op = silu_mul_quant 2025-05-07T20:32:16.3268636Z if compiled: 2025-05-07T20:32:16.3268731Z op = torch.compile(op) 2025-05-07T20:32:16.3268840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3268912Z 2025-05-07T20:32:16.3268999Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3269004Z 2025-05-07T20:32:16.3269101Z moe/activation_test.py:117: 2025-05-07T20:32:16.3269223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3269318Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3269416Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3269905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3270003Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3270351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3270568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3270903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3270998Z kernel = self.compile( 2025-05-07T20:32:16.3271369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3271544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3271666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3271670Z 2025-05-07T20:32:16.3271882Z self = 2025-05-07T20:32:16.3272635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3273128Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59c779c0>} 2025-05-07T20:32:16.3273866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3274051Z context = 2025-05-07T20:32:16.3274056Z 2025-05-07T20:32:16.3274223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3274560Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3274666Z module_map=module_map) 2025-05-07T20:32:16.3274829Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3274925Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3274998Z E ^ 2025-05-07T20:32:16.3275343Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3275423Z 2025-05-07T20:32:16.3275824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3275829Z 2025-05-07T20:32:16.3275930Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3276146Z self=, 2025-05-07T20:32:16.3276227Z T=4096, 2025-05-07T20:32:16.3276305Z D=5120, 2025-05-07T20:32:16.3276387Z scale_ub=1200.0, 2025-05-07T20:32:16.3276473Z contiguous=False, 2025-05-07T20:32:16.3276554Z compiled=True, 2025-05-07T20:32:16.3276627Z ) 2025-05-07T20:32:16.3276840Z self = 2025-05-07T20:32:16.3277007Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3277012Z 2025-05-07T20:32:16.3277090Z @given( 2025-05-07T20:32:16.3277212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3277308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3277422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3277539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3277650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3277726Z ) 2025-05-07T20:32:16.3277962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3278054Z def test_silu_mul_quant( 2025-05-07T20:32:16.3278140Z self, 2025-05-07T20:32:16.3278237Z T: int, 2025-05-07T20:32:16.3278317Z D: int, 2025-05-07T20:32:16.3278433Z scale_ub: Optional[float], 2025-05-07T20:32:16.3278525Z contiguous: bool, 2025-05-07T20:32:16.3278606Z compiled: bool, 2025-05-07T20:32:16.3278685Z ) -> None: 2025-05-07T20:32:16.3278777Z torch.manual_seed(2025) 2025-05-07T20:32:16.3278851Z 2025-05-07T20:32:16.3279019Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3279097Z 2025-05-07T20:32:16.3279189Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3279310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3279397Z x = x_sign * x_clamp 2025-05-07T20:32:16.3279479Z x0 = x[:, :D] 2025-05-07T20:32:16.3279555Z x1 = x[:, D:] 2025-05-07T20:32:16.3279627Z 2025-05-07T20:32:16.3279710Z if contiguous: 2025-05-07T20:32:16.3279796Z x0 = x0.contiguous() 2025-05-07T20:32:16.3279888Z x1 = x1.contiguous() 2025-05-07T20:32:16.3279961Z 2025-05-07T20:32:16.3280049Z if scale_ub is not None: 2025-05-07T20:32:16.3280151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3280282Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3280353Z ) 2025-05-07T20:32:16.3280431Z else: 2025-05-07T20:32:16.3280527Z scale_ub_tensor = None 2025-05-07T20:32:16.3280599Z 2025-05-07T20:32:16.3280728Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3280814Z op = silu_mul_quant 2025-05-07T20:32:16.3280894Z if compiled: 2025-05-07T20:32:16.3280997Z op = torch.compile(op) 2025-05-07T20:32:16.3281099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3281170Z 2025-05-07T20:32:16.3281261Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3281265Z 2025-05-07T20:32:16.3281444Z moe/activation_test.py:117: 2025-05-07T20:32:16.3281576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3281673Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3281767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3282128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3282216Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3282799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3282896Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3283242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3283462Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3283796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3283885Z kernel = self.compile( 2025-05-07T20:32:16.3284259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3284427Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3284547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3284560Z 2025-05-07T20:32:16.3284765Z self = 2025-05-07T20:32:16.3285522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3286019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5985d8a0>} 2025-05-07T20:32:16.3286749Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3286934Z context = 2025-05-07T20:32:16.3286939Z 2025-05-07T20:32:16.3287103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3287358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3287461Z module_map=module_map) 2025-05-07T20:32:16.3287623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3287721Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3287793Z E ^ 2025-05-07T20:32:16.3288149Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3288154Z 2025-05-07T20:32:16.3288604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3288609Z 2025-05-07T20:32:16.3288708Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3288927Z self=, 2025-05-07T20:32:16.3289004Z T=2048, 2025-05-07T20:32:16.3289083Z D=7168, 2025-05-07T20:32:16.3289165Z scale_ub=1200.0, 2025-05-07T20:32:16.3289246Z contiguous=False, 2025-05-07T20:32:16.3289329Z compiled=False, 2025-05-07T20:32:16.3289403Z ) 2025-05-07T20:32:16.3289613Z self = 2025-05-07T20:32:16.3289789Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.3289794Z 2025-05-07T20:32:16.3289870Z @given( 2025-05-07T20:32:16.3290068Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3290173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3290282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3290398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3290505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3290575Z ) 2025-05-07T20:32:16.3290815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3290976Z def test_silu_mul_quant( 2025-05-07T20:32:16.3291047Z self, 2025-05-07T20:32:16.3291123Z T: int, 2025-05-07T20:32:16.3291196Z D: int, 2025-05-07T20:32:16.3291292Z scale_ub: Optional[float], 2025-05-07T20:32:16.3291381Z contiguous: bool, 2025-05-07T20:32:16.3291461Z compiled: bool, 2025-05-07T20:32:16.3291536Z ) -> None: 2025-05-07T20:32:16.3291627Z torch.manual_seed(2025) 2025-05-07T20:32:16.3291697Z 2025-05-07T20:32:16.3291868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3291939Z 2025-05-07T20:32:16.3292027Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3292149Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3292237Z x = x_sign * x_clamp 2025-05-07T20:32:16.3292313Z x0 = x[:, :D] 2025-05-07T20:32:16.3292395Z x1 = x[:, D:] 2025-05-07T20:32:16.3292463Z 2025-05-07T20:32:16.3292549Z if contiguous: 2025-05-07T20:32:16.3292638Z x0 = x0.contiguous() 2025-05-07T20:32:16.3292723Z x1 = x1.contiguous() 2025-05-07T20:32:16.3292794Z 2025-05-07T20:32:16.3292881Z if scale_ub is not None: 2025-05-07T20:32:16.3292982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3293168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3293240Z ) 2025-05-07T20:32:16.3293314Z else: 2025-05-07T20:32:16.3293407Z scale_ub_tensor = None 2025-05-07T20:32:16.3293480Z 2025-05-07T20:32:16.3293605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3293693Z op = silu_mul_quant 2025-05-07T20:32:16.3293774Z if compiled: 2025-05-07T20:32:16.3293870Z op = torch.compile(op) 2025-05-07T20:32:16.3293974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3294044Z 2025-05-07T20:32:16.3294135Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3294142Z 2025-05-07T20:32:16.3294236Z moe/activation_test.py:117: 2025-05-07T20:32:16.3294359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3294458Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3294553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3295039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3295137Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3295485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3295700Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3296033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3296123Z kernel = self.compile( 2025-05-07T20:32:16.3296502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3296668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3296790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3296795Z 2025-05-07T20:32:16.3296995Z self = 2025-05-07T20:32:16.3297834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3298331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5a2ecc20>} 2025-05-07T20:32:16.3299059Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3299320Z context = 2025-05-07T20:32:16.3299325Z 2025-05-07T20:32:16.3299486Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3299741Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3299854Z module_map=module_map) 2025-05-07T20:32:16.3300010Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3300104Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3300182Z E ^ 2025-05-07T20:32:16.3300525Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3300530Z 2025-05-07T20:32:16.3300941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3300946Z 2025-05-07T20:32:16.3301044Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3301261Z self=, 2025-05-07T20:32:16.3301340Z T=1, 2025-05-07T20:32:16.3301414Z D=7168, 2025-05-07T20:32:16.3301493Z scale_ub=None, 2025-05-07T20:32:16.3301579Z contiguous=True, 2025-05-07T20:32:16.3301659Z compiled=False, 2025-05-07T20:32:16.3301737Z ) 2025-05-07T20:32:16.3301948Z self = 2025-05-07T20:32:16.3302105Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3302110Z 2025-05-07T20:32:16.3302184Z @given( 2025-05-07T20:32:16.3302296Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3302392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3302509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3302622Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3302732Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3302806Z ) 2025-05-07T20:32:16.3303044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3303138Z def test_silu_mul_quant( 2025-05-07T20:32:16.3303210Z self, 2025-05-07T20:32:16.3303281Z T: int, 2025-05-07T20:32:16.3303361Z D: int, 2025-05-07T20:32:16.3303462Z scale_ub: Optional[float], 2025-05-07T20:32:16.3303547Z contiguous: bool, 2025-05-07T20:32:16.3303631Z compiled: bool, 2025-05-07T20:32:16.3303707Z ) -> None: 2025-05-07T20:32:16.3303796Z torch.manual_seed(2025) 2025-05-07T20:32:16.3303868Z 2025-05-07T20:32:16.3304030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3304098Z 2025-05-07T20:32:16.3304194Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3304314Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3304401Z x = x_sign * x_clamp 2025-05-07T20:32:16.3304478Z x0 = x[:, :D] 2025-05-07T20:32:16.3304554Z x1 = x[:, D:] 2025-05-07T20:32:16.3304627Z 2025-05-07T20:32:16.3304705Z if contiguous: 2025-05-07T20:32:16.3304791Z x0 = x0.contiguous() 2025-05-07T20:32:16.3304877Z x1 = x1.contiguous() 2025-05-07T20:32:16.3304946Z 2025-05-07T20:32:16.3305116Z if scale_ub is not None: 2025-05-07T20:32:16.3305222Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3305352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3305423Z ) 2025-05-07T20:32:16.3305501Z else: 2025-05-07T20:32:16.3305592Z scale_ub_tensor = None 2025-05-07T20:32:16.3305662Z 2025-05-07T20:32:16.3305792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3305955Z op = silu_mul_quant 2025-05-07T20:32:16.3306038Z if compiled: 2025-05-07T20:32:16.3306133Z op = torch.compile(op) 2025-05-07T20:32:16.3306232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3306305Z 2025-05-07T20:32:16.3306390Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3306395Z 2025-05-07T20:32:16.3306489Z moe/activation_test.py:117: 2025-05-07T20:32:16.3306614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3306716Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3306810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3307300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3307393Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3307745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3307965Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3308311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3308416Z kernel = self.compile( 2025-05-07T20:32:16.3308810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3308985Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3309104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3309109Z 2025-05-07T20:32:16.3309307Z self = 2025-05-07T20:32:16.3310064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3310561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a5a2ee700>} 2025-05-07T20:32:16.3311290Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3311478Z context = 2025-05-07T20:32:16.3311482Z 2025-05-07T20:32:16.3311641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3311895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3311997Z module_map=module_map) 2025-05-07T20:32:16.3312158Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3312256Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3312329Z E ^ 2025-05-07T20:32:16.3312675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3312680Z 2025-05-07T20:32:16.3313080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3313085Z 2025-05-07T20:32:16.3313184Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3313481Z self=, 2025-05-07T20:32:16.3313558Z T=16384, 2025-05-07T20:32:16.3313636Z D=7168, 2025-05-07T20:32:16.3313714Z scale_ub=1200.0, 2025-05-07T20:32:16.3313797Z contiguous=False, 2025-05-07T20:32:16.3313878Z compiled=True, 2025-05-07T20:32:16.3313950Z ) 2025-05-07T20:32:16.3314161Z self = 2025-05-07T20:32:16.3314434Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3314439Z 2025-05-07T20:32:16.3314515Z @given( 2025-05-07T20:32:16.3314634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3314729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3314841Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3314961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3315073Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3315149Z ) 2025-05-07T20:32:16.3315396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3315484Z def test_silu_mul_quant( 2025-05-07T20:32:16.3315556Z self, 2025-05-07T20:32:16.3315632Z T: int, 2025-05-07T20:32:16.3315705Z D: int, 2025-05-07T20:32:16.3315798Z scale_ub: Optional[float], 2025-05-07T20:32:16.3315889Z contiguous: bool, 2025-05-07T20:32:16.3315974Z compiled: bool, 2025-05-07T20:32:16.3316057Z ) -> None: 2025-05-07T20:32:16.3316147Z torch.manual_seed(2025) 2025-05-07T20:32:16.3316219Z 2025-05-07T20:32:16.3316388Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3316459Z 2025-05-07T20:32:16.3316546Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3316669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3316755Z x = x_sign * x_clamp 2025-05-07T20:32:16.3316833Z x0 = x[:, :D] 2025-05-07T20:32:16.3316912Z x1 = x[:, D:] 2025-05-07T20:32:16.3316979Z 2025-05-07T20:32:16.3317058Z if contiguous: 2025-05-07T20:32:16.3317149Z x0 = x0.contiguous() 2025-05-07T20:32:16.3317232Z x1 = x1.contiguous() 2025-05-07T20:32:16.3317302Z 2025-05-07T20:32:16.3317386Z if scale_ub is not None: 2025-05-07T20:32:16.3317489Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3317625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3317698Z ) 2025-05-07T20:32:16.3317773Z else: 2025-05-07T20:32:16.3317867Z scale_ub_tensor = None 2025-05-07T20:32:16.3317937Z 2025-05-07T20:32:16.3318063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3318154Z op = silu_mul_quant 2025-05-07T20:32:16.3318239Z if compiled: 2025-05-07T20:32:16.3318356Z op = torch.compile(op) 2025-05-07T20:32:16.3318479Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3318558Z 2025-05-07T20:32:16.3318652Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3318656Z 2025-05-07T20:32:16.3318748Z moe/activation_test.py:117: 2025-05-07T20:32:16.3318872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3318975Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3319068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3319430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3319521Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3319999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3320094Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3320527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3320747Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3321084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3321172Z kernel = self.compile( 2025-05-07T20:32:16.3321544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3321788Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3321909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3321914Z 2025-05-07T20:32:16.3322114Z self = 2025-05-07T20:32:16.3322871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3323365Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a700c2340>} 2025-05-07T20:32:16.3324094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3324283Z context = 2025-05-07T20:32:16.3324288Z 2025-05-07T20:32:16.3324453Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3324704Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3324813Z module_map=module_map) 2025-05-07T20:32:16.3324971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3325068Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3325147Z E ^ 2025-05-07T20:32:16.3325495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3325500Z 2025-05-07T20:32:16.3325900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3325908Z 2025-05-07T20:32:16.3326013Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3326228Z self=, 2025-05-07T20:32:16.3326305Z T=1, 2025-05-07T20:32:16.3326385Z D=7168, 2025-05-07T20:32:16.3326461Z scale_ub=None, 2025-05-07T20:32:16.3326547Z contiguous=False, 2025-05-07T20:32:16.3326627Z compiled=False, 2025-05-07T20:32:16.3326697Z ) 2025-05-07T20:32:16.3326910Z self = 2025-05-07T20:32:16.3327076Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.3327081Z 2025-05-07T20:32:16.3327157Z @given( 2025-05-07T20:32:16.3327274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3327369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3327482Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3327592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3327705Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3327778Z ) 2025-05-07T20:32:16.3328015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3328103Z def test_silu_mul_quant( 2025-05-07T20:32:16.3328176Z self, 2025-05-07T20:32:16.3328252Z T: int, 2025-05-07T20:32:16.3328325Z D: int, 2025-05-07T20:32:16.3328421Z scale_ub: Optional[float], 2025-05-07T20:32:16.3328505Z contiguous: bool, 2025-05-07T20:32:16.3328664Z compiled: bool, 2025-05-07T20:32:16.3328743Z ) -> None: 2025-05-07T20:32:16.3328834Z torch.manual_seed(2025) 2025-05-07T20:32:16.3328907Z 2025-05-07T20:32:16.3329069Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3329141Z 2025-05-07T20:32:16.3329232Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3329352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3329513Z x = x_sign * x_clamp 2025-05-07T20:32:16.3329592Z x0 = x[:, :D] 2025-05-07T20:32:16.3329669Z x1 = x[:, D:] 2025-05-07T20:32:16.3329739Z 2025-05-07T20:32:16.3329821Z if contiguous: 2025-05-07T20:32:16.3329908Z x0 = x0.contiguous() 2025-05-07T20:32:16.3329991Z x1 = x1.contiguous() 2025-05-07T20:32:16.3330064Z 2025-05-07T20:32:16.3330148Z if scale_ub is not None: 2025-05-07T20:32:16.3330255Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3330389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3330462Z ) 2025-05-07T20:32:16.3330536Z else: 2025-05-07T20:32:16.3330627Z scale_ub_tensor = None 2025-05-07T20:32:16.3330697Z 2025-05-07T20:32:16.3330825Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3330910Z op = silu_mul_quant 2025-05-07T20:32:16.3330990Z if compiled: 2025-05-07T20:32:16.3331094Z op = torch.compile(op) 2025-05-07T20:32:16.3331194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3331262Z 2025-05-07T20:32:16.3331356Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3331360Z 2025-05-07T20:32:16.3331452Z moe/activation_test.py:117: 2025-05-07T20:32:16.3331581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3331678Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3331773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3332267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3332359Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3332709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3332926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3333304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3333397Z kernel = self.compile( 2025-05-07T20:32:16.3333768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3333935Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3334058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3334066Z 2025-05-07T20:32:16.3334264Z self = 2025-05-07T20:32:16.3335021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3335515Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a703bbc40>} 2025-05-07T20:32:16.3336243Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3336429Z context = 2025-05-07T20:32:16.3336434Z 2025-05-07T20:32:16.3336684Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3336945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3337045Z module_map=module_map) 2025-05-07T20:32:16.3337201Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3337301Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3337373Z E ^ 2025-05-07T20:32:16.3337796Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3337801Z 2025-05-07T20:32:16.3338208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3338214Z 2025-05-07T20:32:16.3338334Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3338577Z self=, 2025-05-07T20:32:16.3338650Z T=2048, 2025-05-07T20:32:16.3338726Z D=7168, 2025-05-07T20:32:16.3338805Z scale_ub=None, 2025-05-07T20:32:16.3338888Z contiguous=False, 2025-05-07T20:32:16.3338973Z compiled=True, 2025-05-07T20:32:16.3339043Z ) 2025-05-07T20:32:16.3339251Z self = 2025-05-07T20:32:16.3339419Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3339431Z 2025-05-07T20:32:16.3339503Z @given( 2025-05-07T20:32:16.3339619Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3339719Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3339829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3339942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3340056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3340125Z ) 2025-05-07T20:32:16.3340371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3340462Z def test_silu_mul_quant( 2025-05-07T20:32:16.3340536Z self, 2025-05-07T20:32:16.3340610Z T: int, 2025-05-07T20:32:16.3340685Z D: int, 2025-05-07T20:32:16.3340781Z scale_ub: Optional[float], 2025-05-07T20:32:16.3340871Z contiguous: bool, 2025-05-07T20:32:16.3340951Z compiled: bool, 2025-05-07T20:32:16.3341024Z ) -> None: 2025-05-07T20:32:16.3341121Z torch.manual_seed(2025) 2025-05-07T20:32:16.3341191Z 2025-05-07T20:32:16.3341352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3341423Z 2025-05-07T20:32:16.3341511Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3341637Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3341721Z x = x_sign * x_clamp 2025-05-07T20:32:16.3341796Z x0 = x[:, :D] 2025-05-07T20:32:16.3341874Z x1 = x[:, D:] 2025-05-07T20:32:16.3341945Z 2025-05-07T20:32:16.3342026Z if contiguous: 2025-05-07T20:32:16.3342120Z x0 = x0.contiguous() 2025-05-07T20:32:16.3342205Z x1 = x1.contiguous() 2025-05-07T20:32:16.3342276Z 2025-05-07T20:32:16.3342364Z if scale_ub is not None: 2025-05-07T20:32:16.3342464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3342593Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3342669Z ) 2025-05-07T20:32:16.3346656Z else: 2025-05-07T20:32:16.3346756Z scale_ub_tensor = None 2025-05-07T20:32:16.3346827Z 2025-05-07T20:32:16.3346960Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3347045Z op = silu_mul_quant 2025-05-07T20:32:16.3347126Z if compiled: 2025-05-07T20:32:16.3347227Z op = torch.compile(op) 2025-05-07T20:32:16.3347328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3347400Z 2025-05-07T20:32:16.3347487Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3347617Z 2025-05-07T20:32:16.3347713Z moe/activation_test.py:117: 2025-05-07T20:32:16.3347845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3347942Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3348039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3348454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3348622Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3349106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3349201Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3349549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3349768Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3350104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3350194Z kernel = self.compile( 2025-05-07T20:32:16.3350574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3350744Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3350878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3350883Z 2025-05-07T20:32:16.3351082Z self = 2025-05-07T20:32:16.3351839Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3352342Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a9c629800>} 2025-05-07T20:32:16.3353071Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3353259Z context = 2025-05-07T20:32:16.3353268Z 2025-05-07T20:32:16.3353427Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3353685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3353790Z module_map=module_map) 2025-05-07T20:32:16.3353950Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3354048Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3354122Z E ^ 2025-05-07T20:32:16.3354470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3354474Z 2025-05-07T20:32:16.3354880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3354885Z 2025-05-07T20:32:16.3354983Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3355201Z self=, 2025-05-07T20:32:16.3355279Z T=4096, 2025-05-07T20:32:16.3355350Z D=7168, 2025-05-07T20:32:16.3355430Z scale_ub=None, 2025-05-07T20:32:16.3355511Z contiguous=False, 2025-05-07T20:32:16.3355588Z compiled=True, 2025-05-07T20:32:16.3355659Z ) 2025-05-07T20:32:16.3355870Z self = 2025-05-07T20:32:16.3356036Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3356044Z 2025-05-07T20:32:16.3356208Z @given( 2025-05-07T20:32:16.3356327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3356427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3356536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3356646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3356761Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3356905Z ) 2025-05-07T20:32:16.3357143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3357240Z def test_silu_mul_quant( 2025-05-07T20:32:16.3357313Z self, 2025-05-07T20:32:16.3357388Z T: int, 2025-05-07T20:32:16.3357465Z D: int, 2025-05-07T20:32:16.3357559Z scale_ub: Optional[float], 2025-05-07T20:32:16.3357647Z contiguous: bool, 2025-05-07T20:32:16.3357733Z compiled: bool, 2025-05-07T20:32:16.3357805Z ) -> None: 2025-05-07T20:32:16.3357902Z torch.manual_seed(2025) 2025-05-07T20:32:16.3357980Z 2025-05-07T20:32:16.3358171Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3358265Z 2025-05-07T20:32:16.3358355Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3358477Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3358564Z x = x_sign * x_clamp 2025-05-07T20:32:16.3358643Z x0 = x[:, :D] 2025-05-07T20:32:16.3358727Z x1 = x[:, D:] 2025-05-07T20:32:16.3358801Z 2025-05-07T20:32:16.3358881Z if contiguous: 2025-05-07T20:32:16.3358970Z x0 = x0.contiguous() 2025-05-07T20:32:16.3359058Z x1 = x1.contiguous() 2025-05-07T20:32:16.3359128Z 2025-05-07T20:32:16.3359426Z if scale_ub is not None: 2025-05-07T20:32:16.3359581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3359751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3359836Z ) 2025-05-07T20:32:16.3359919Z else: 2025-05-07T20:32:16.3360010Z scale_ub_tensor = None 2025-05-07T20:32:16.3360084Z 2025-05-07T20:32:16.3360209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3360298Z op = silu_mul_quant 2025-05-07T20:32:16.3360383Z if compiled: 2025-05-07T20:32:16.3360479Z op = torch.compile(op) 2025-05-07T20:32:16.3360579Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3360655Z 2025-05-07T20:32:16.3360741Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3360746Z 2025-05-07T20:32:16.3360841Z moe/activation_test.py:117: 2025-05-07T20:32:16.3360968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3361064Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3361168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3361527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3361620Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3362107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3362200Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3362552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3362774Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3363102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3363196Z kernel = self.compile( 2025-05-07T20:32:16.3363568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3363741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3364010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3364016Z 2025-05-07T20:32:16.3364218Z self = 2025-05-07T20:32:16.3364977Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3365581Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43953880>} 2025-05-07T20:32:16.3366313Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3366499Z context = 2025-05-07T20:32:16.3366508Z 2025-05-07T20:32:16.3366666Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3366923Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3367025Z module_map=module_map) 2025-05-07T20:32:16.3367183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3367281Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3367358Z E ^ 2025-05-07T20:32:16.3367704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3367709Z 2025-05-07T20:32:16.3368115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3368120Z 2025-05-07T20:32:16.3368248Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3368493Z self=, 2025-05-07T20:32:16.3368570Z T=16384, 2025-05-07T20:32:16.3368650Z D=5120, 2025-05-07T20:32:16.3368730Z scale_ub=1200.0, 2025-05-07T20:32:16.3368814Z contiguous=False, 2025-05-07T20:32:16.3368898Z compiled=False, 2025-05-07T20:32:16.3368970Z ) 2025-05-07T20:32:16.3369183Z self = 2025-05-07T20:32:16.3369361Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.3369371Z 2025-05-07T20:32:16.3369444Z @given( 2025-05-07T20:32:16.3369562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3369657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3369767Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3369884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3369993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3370061Z ) 2025-05-07T20:32:16.3370308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3370397Z def test_silu_mul_quant( 2025-05-07T20:32:16.3370471Z self, 2025-05-07T20:32:16.3370546Z T: int, 2025-05-07T20:32:16.3370617Z D: int, 2025-05-07T20:32:16.3370712Z scale_ub: Optional[float], 2025-05-07T20:32:16.3370801Z contiguous: bool, 2025-05-07T20:32:16.3370884Z compiled: bool, 2025-05-07T20:32:16.3370971Z ) -> None: 2025-05-07T20:32:16.3371061Z torch.manual_seed(2025) 2025-05-07T20:32:16.3371129Z 2025-05-07T20:32:16.3371294Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3371365Z 2025-05-07T20:32:16.3371453Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3371576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3371660Z x = x_sign * x_clamp 2025-05-07T20:32:16.3371736Z x0 = x[:, :D] 2025-05-07T20:32:16.3371897Z x1 = x[:, D:] 2025-05-07T20:32:16.3371968Z 2025-05-07T20:32:16.3372048Z if contiguous: 2025-05-07T20:32:16.3372136Z x0 = x0.contiguous() 2025-05-07T20:32:16.3372221Z x1 = x1.contiguous() 2025-05-07T20:32:16.3372291Z 2025-05-07T20:32:16.3372387Z if scale_ub is not None: 2025-05-07T20:32:16.3372487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3372619Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3372844Z ) 2025-05-07T20:32:16.3372918Z else: 2025-05-07T20:32:16.3373066Z scale_ub_tensor = None 2025-05-07T20:32:16.3373136Z 2025-05-07T20:32:16.3373262Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3373350Z op = silu_mul_quant 2025-05-07T20:32:16.3373430Z if compiled: 2025-05-07T20:32:16.3373526Z op = torch.compile(op) 2025-05-07T20:32:16.3373629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3373704Z 2025-05-07T20:32:16.3373789Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3373797Z 2025-05-07T20:32:16.3373889Z moe/activation_test.py:117: 2025-05-07T20:32:16.3374012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3374113Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3374206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3374695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3374797Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3375149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3375371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3375703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3375795Z kernel = self.compile( 2025-05-07T20:32:16.3376171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3376339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3376461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3376470Z 2025-05-07T20:32:16.3376672Z self = 2025-05-07T20:32:16.3377426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3377920Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a43952700>} 2025-05-07T20:32:16.3378650Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3378839Z context = 2025-05-07T20:32:16.3378844Z 2025-05-07T20:32:16.3379004Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3379262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3379369Z module_map=module_map) 2025-05-07T20:32:16.3379524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3379618Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3379694Z E ^ 2025-05-07T20:32:16.3380039Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3380151Z 2025-05-07T20:32:16.3380561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3380565Z 2025-05-07T20:32:16.3380663Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3380877Z self=, 2025-05-07T20:32:16.3380954Z T=16384, 2025-05-07T20:32:16.3381103Z D=5120, 2025-05-07T20:32:16.3381183Z scale_ub=1200.0, 2025-05-07T20:32:16.3381267Z contiguous=True, 2025-05-07T20:32:16.3381344Z compiled=True, 2025-05-07T20:32:16.3381417Z ) 2025-05-07T20:32:16.3381628Z self = 2025-05-07T20:32:16.3381796Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.3381800Z 2025-05-07T20:32:16.3381875Z @given( 2025-05-07T20:32:16.3381990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3382092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3382206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3382318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3382426Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3382500Z ) 2025-05-07T20:32:16.3382739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3382842Z def test_silu_mul_quant( 2025-05-07T20:32:16.3382914Z self, 2025-05-07T20:32:16.3382988Z T: int, 2025-05-07T20:32:16.3383062Z D: int, 2025-05-07T20:32:16.3383155Z scale_ub: Optional[float], 2025-05-07T20:32:16.3383241Z contiguous: bool, 2025-05-07T20:32:16.3383326Z compiled: bool, 2025-05-07T20:32:16.3383402Z ) -> None: 2025-05-07T20:32:16.3383493Z torch.manual_seed(2025) 2025-05-07T20:32:16.3383567Z 2025-05-07T20:32:16.3383734Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3383805Z 2025-05-07T20:32:16.3383900Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3384021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3384110Z x = x_sign * x_clamp 2025-05-07T20:32:16.3384185Z x0 = x[:, :D] 2025-05-07T20:32:16.3384260Z x1 = x[:, D:] 2025-05-07T20:32:16.3384329Z 2025-05-07T20:32:16.3384409Z if contiguous: 2025-05-07T20:32:16.3384502Z x0 = x0.contiguous() 2025-05-07T20:32:16.3384591Z x1 = x1.contiguous() 2025-05-07T20:32:16.3384661Z 2025-05-07T20:32:16.3384748Z if scale_ub is not None: 2025-05-07T20:32:16.3384852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3384980Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3385053Z ) 2025-05-07T20:32:16.3385125Z else: 2025-05-07T20:32:16.3385214Z scale_ub_tensor = None 2025-05-07T20:32:16.3385287Z 2025-05-07T20:32:16.3385416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3385502Z op = silu_mul_quant 2025-05-07T20:32:16.3385585Z if compiled: 2025-05-07T20:32:16.3385680Z op = torch.compile(op) 2025-05-07T20:32:16.3385779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3385851Z 2025-05-07T20:32:16.3385937Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3385949Z 2025-05-07T20:32:16.3386042Z moe/activation_test.py:117: 2025-05-07T20:32:16.3386169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3386264Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3386360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3386718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3386807Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3387371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3387466Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3387815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3388034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3388463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3388563Z kernel = self.compile( 2025-05-07T20:32:16.3388951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3389121Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3389245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3389250Z 2025-05-07T20:32:16.3389456Z self = 2025-05-07T20:32:16.3390212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3390703Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a58561e40>} 2025-05-07T20:32:16.3391434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3391621Z context = 2025-05-07T20:32:16.3391626Z 2025-05-07T20:32:16.3391786Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3392045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3392149Z module_map=module_map) 2025-05-07T20:32:16.3392304Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3392403Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3392477Z E ^ 2025-05-07T20:32:16.3392821Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3392833Z 2025-05-07T20:32:16.3393237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3393242Z 2025-05-07T20:32:16.3393340Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3393558Z self=, 2025-05-07T20:32:16.3393632Z T=16384, 2025-05-07T20:32:16.3393706Z D=5120, 2025-05-07T20:32:16.3393791Z scale_ub=None, 2025-05-07T20:32:16.3393873Z contiguous=False, 2025-05-07T20:32:16.3393951Z compiled=True, 2025-05-07T20:32:16.3394026Z ) 2025-05-07T20:32:16.3394236Z self = 2025-05-07T20:32:16.3394409Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3394413Z 2025-05-07T20:32:16.3394486Z @given( 2025-05-07T20:32:16.3394606Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3394701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3394810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3394921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3395035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3395106Z ) 2025-05-07T20:32:16.3395347Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3395523Z def test_silu_mul_quant( 2025-05-07T20:32:16.3395596Z self, 2025-05-07T20:32:16.3395671Z T: int, 2025-05-07T20:32:16.3395743Z D: int, 2025-05-07T20:32:16.3395837Z scale_ub: Optional[float], 2025-05-07T20:32:16.3395924Z contiguous: bool, 2025-05-07T20:32:16.3396006Z compiled: bool, 2025-05-07T20:32:16.3396080Z ) -> None: 2025-05-07T20:32:16.3396174Z torch.manual_seed(2025) 2025-05-07T20:32:16.3396321Z 2025-05-07T20:32:16.3396483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3396555Z 2025-05-07T20:32:16.3396643Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3396764Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3396849Z x = x_sign * x_clamp 2025-05-07T20:32:16.3396926Z x0 = x[:, :D] 2025-05-07T20:32:16.3397005Z x1 = x[:, D:] 2025-05-07T20:32:16.3397072Z 2025-05-07T20:32:16.3397149Z if contiguous: 2025-05-07T20:32:16.3397242Z x0 = x0.contiguous() 2025-05-07T20:32:16.3397326Z x1 = x1.contiguous() 2025-05-07T20:32:16.3397395Z 2025-05-07T20:32:16.3397484Z if scale_ub is not None: 2025-05-07T20:32:16.3397585Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3397714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3397791Z ) 2025-05-07T20:32:16.3397866Z else: 2025-05-07T20:32:16.3397961Z scale_ub_tensor = None 2025-05-07T20:32:16.3398036Z 2025-05-07T20:32:16.3398159Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3398250Z op = silu_mul_quant 2025-05-07T20:32:16.3398332Z if compiled: 2025-05-07T20:32:16.3398447Z op = torch.compile(op) 2025-05-07T20:32:16.3398561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3398647Z 2025-05-07T20:32:16.3398735Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3398739Z 2025-05-07T20:32:16.3398839Z moe/activation_test.py:117: 2025-05-07T20:32:16.3398962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3399056Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3399151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3399509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3399604Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3400085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3400179Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3400530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3400745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3401078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3401174Z kernel = self.compile( 2025-05-07T20:32:16.3401544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3401718Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3401839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3401848Z 2025-05-07T20:32:16.3402048Z self = 2025-05-07T20:32:16.3402805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3403378Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59321e40>} 2025-05-07T20:32:16.3404105Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3404290Z context = 2025-05-07T20:32:16.3404392Z 2025-05-07T20:32:16.3404558Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3404813Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3404916Z module_map=module_map) 2025-05-07T20:32:16.3405074Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3405170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3405246Z E ^ 2025-05-07T20:32:16.3405599Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3405604Z 2025-05-07T20:32:16.3406006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3406011Z 2025-05-07T20:32:16.3406114Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3406331Z self=, 2025-05-07T20:32:16.3406421Z T=2048, 2025-05-07T20:32:16.3406492Z D=5120, 2025-05-07T20:32:16.3406571Z scale_ub=None, 2025-05-07T20:32:16.3406657Z contiguous=False, 2025-05-07T20:32:16.3406738Z compiled=True, 2025-05-07T20:32:16.3406814Z ) 2025-05-07T20:32:16.3407026Z self = 2025-05-07T20:32:16.3407191Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3407196Z 2025-05-07T20:32:16.3407271Z @given( 2025-05-07T20:32:16.3407394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3407490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3407605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3407717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3407827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3407897Z ) 2025-05-07T20:32:16.3408162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3408268Z def test_silu_mul_quant( 2025-05-07T20:32:16.3408357Z self, 2025-05-07T20:32:16.3408430Z T: int, 2025-05-07T20:32:16.3408504Z D: int, 2025-05-07T20:32:16.3408600Z scale_ub: Optional[float], 2025-05-07T20:32:16.3408685Z contiguous: bool, 2025-05-07T20:32:16.3408773Z compiled: bool, 2025-05-07T20:32:16.3408848Z ) -> None: 2025-05-07T20:32:16.3408938Z torch.manual_seed(2025) 2025-05-07T20:32:16.3409010Z 2025-05-07T20:32:16.3409177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3409248Z 2025-05-07T20:32:16.3409335Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3409458Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3409545Z x = x_sign * x_clamp 2025-05-07T20:32:16.3409622Z x0 = x[:, :D] 2025-05-07T20:32:16.3409697Z x1 = x[:, D:] 2025-05-07T20:32:16.3409775Z 2025-05-07T20:32:16.3409855Z if contiguous: 2025-05-07T20:32:16.3409942Z x0 = x0.contiguous() 2025-05-07T20:32:16.3410029Z x1 = x1.contiguous() 2025-05-07T20:32:16.3410099Z 2025-05-07T20:32:16.3410185Z if scale_ub is not None: 2025-05-07T20:32:16.3410291Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3410421Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3410493Z ) 2025-05-07T20:32:16.3410569Z else: 2025-05-07T20:32:16.3410744Z scale_ub_tensor = None 2025-05-07T20:32:16.3410816Z 2025-05-07T20:32:16.3410943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3411028Z op = silu_mul_quant 2025-05-07T20:32:16.3411113Z if compiled: 2025-05-07T20:32:16.3411209Z op = torch.compile(op) 2025-05-07T20:32:16.3411308Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3411380Z 2025-05-07T20:32:16.3411549Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3411554Z 2025-05-07T20:32:16.3411645Z moe/activation_test.py:117: 2025-05-07T20:32:16.3411775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3411870Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3411968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3412330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3412418Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3412906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3413048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3413395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3413620Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3413955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3414048Z kernel = self.compile( 2025-05-07T20:32:16.3414419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3414587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3414712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3414723Z 2025-05-07T20:32:16.3414922Z self = 2025-05-07T20:32:16.3415680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3416179Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59320d60>} 2025-05-07T20:32:16.3416906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3417092Z context = 2025-05-07T20:32:16.3417097Z 2025-05-07T20:32:16.3417260Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3417516Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3417619Z module_map=module_map) 2025-05-07T20:32:16.3417777Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3417875Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3417954Z E ^ 2025-05-07T20:32:16.3418298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3418307Z 2025-05-07T20:32:16.3418708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3418712Z 2025-05-07T20:32:16.3418811Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3419031Z self=, 2025-05-07T20:32:16.3419188Z T=2048, 2025-05-07T20:32:16.3419263Z D=5120, 2025-05-07T20:32:16.3419344Z scale_ub=1200.0, 2025-05-07T20:32:16.3419430Z contiguous=False, 2025-05-07T20:32:16.3419508Z compiled=True, 2025-05-07T20:32:16.3419583Z ) 2025-05-07T20:32:16.3419794Z self = 2025-05-07T20:32:16.3419969Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3420048Z 2025-05-07T20:32:16.3420124Z @given( 2025-05-07T20:32:16.3420238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3420335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3420446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3420560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3420672Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3420743Z ) 2025-05-07T20:32:16.3420985Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3421080Z def test_silu_mul_quant( 2025-05-07T20:32:16.3421153Z self, 2025-05-07T20:32:16.3421231Z T: int, 2025-05-07T20:32:16.3421303Z D: int, 2025-05-07T20:32:16.3421398Z scale_ub: Optional[float], 2025-05-07T20:32:16.3421486Z contiguous: bool, 2025-05-07T20:32:16.3421568Z compiled: bool, 2025-05-07T20:32:16.3421641Z ) -> None: 2025-05-07T20:32:16.3421741Z torch.manual_seed(2025) 2025-05-07T20:32:16.3421810Z 2025-05-07T20:32:16.3421975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3422045Z 2025-05-07T20:32:16.3422132Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3422252Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3422340Z x = x_sign * x_clamp 2025-05-07T20:32:16.3422414Z x0 = x[:, :D] 2025-05-07T20:32:16.3422493Z x1 = x[:, D:] 2025-05-07T20:32:16.3422559Z 2025-05-07T20:32:16.3422641Z if contiguous: 2025-05-07T20:32:16.3422730Z x0 = x0.contiguous() 2025-05-07T20:32:16.3422818Z x1 = x1.contiguous() 2025-05-07T20:32:16.3422886Z 2025-05-07T20:32:16.3422975Z if scale_ub is not None: 2025-05-07T20:32:16.3423075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3423204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3423284Z ) 2025-05-07T20:32:16.3423356Z else: 2025-05-07T20:32:16.3423447Z scale_ub_tensor = None 2025-05-07T20:32:16.3423519Z 2025-05-07T20:32:16.3423644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3423729Z op = silu_mul_quant 2025-05-07T20:32:16.3423813Z if compiled: 2025-05-07T20:32:16.3423907Z op = torch.compile(op) 2025-05-07T20:32:16.3424010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3424076Z 2025-05-07T20:32:16.3424167Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3424172Z 2025-05-07T20:32:16.3424268Z moe/activation_test.py:117: 2025-05-07T20:32:16.3424393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3424491Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3424593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3424949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3425045Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3425528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3425621Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3425971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3426270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3426602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3426694Z kernel = self.compile( 2025-05-07T20:32:16.3427064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3427236Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3427434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3427439Z 2025-05-07T20:32:16.3427638Z self = 2025-05-07T20:32:16.3428451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3428948Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59031a80>} 2025-05-07T20:32:16.3429680Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3429863Z context = 2025-05-07T20:32:16.3429873Z 2025-05-07T20:32:16.3430039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3430290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3430391Z module_map=module_map) 2025-05-07T20:32:16.3430548Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3430642Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3430715Z E ^ 2025-05-07T20:32:16.3431064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3431069Z 2025-05-07T20:32:16.3431471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3431475Z 2025-05-07T20:32:16.3431575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3431797Z self=, 2025-05-07T20:32:16.3431869Z T=4096, 2025-05-07T20:32:16.3431942Z D=5120, 2025-05-07T20:32:16.3432022Z scale_ub=1200.0, 2025-05-07T20:32:16.3432102Z contiguous=True, 2025-05-07T20:32:16.3432183Z compiled=True, 2025-05-07T20:32:16.3432251Z ) 2025-05-07T20:32:16.3432461Z self = 2025-05-07T20:32:16.3432628Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.3432633Z 2025-05-07T20:32:16.3432707Z @given( 2025-05-07T20:32:16.3432823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3432918Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3433026Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3433139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3433248Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3433324Z ) 2025-05-07T20:32:16.3433565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3433655Z def test_silu_mul_quant( 2025-05-07T20:32:16.3433728Z self, 2025-05-07T20:32:16.3433803Z T: int, 2025-05-07T20:32:16.3433878Z D: int, 2025-05-07T20:32:16.3433977Z scale_ub: Optional[float], 2025-05-07T20:32:16.3434063Z contiguous: bool, 2025-05-07T20:32:16.3434146Z compiled: bool, 2025-05-07T20:32:16.3434229Z ) -> None: 2025-05-07T20:32:16.3434447Z torch.manual_seed(2025) 2025-05-07T20:32:16.3434519Z 2025-05-07T20:32:16.3434691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3434760Z 2025-05-07T20:32:16.3434845Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3434970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3435055Z x = x_sign * x_clamp 2025-05-07T20:32:16.3435132Z x0 = x[:, :D] 2025-05-07T20:32:16.3435310Z x1 = x[:, D:] 2025-05-07T20:32:16.3435379Z 2025-05-07T20:32:16.3435461Z if contiguous: 2025-05-07T20:32:16.3435557Z x0 = x0.contiguous() 2025-05-07T20:32:16.3435642Z x1 = x1.contiguous() 2025-05-07T20:32:16.3435714Z 2025-05-07T20:32:16.3435799Z if scale_ub is not None: 2025-05-07T20:32:16.3435901Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3436032Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3436106Z ) 2025-05-07T20:32:16.3436184Z else: 2025-05-07T20:32:16.3436282Z scale_ub_tensor = None 2025-05-07T20:32:16.3436354Z 2025-05-07T20:32:16.3436481Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3436571Z op = silu_mul_quant 2025-05-07T20:32:16.3436653Z if compiled: 2025-05-07T20:32:16.3436747Z op = torch.compile(op) 2025-05-07T20:32:16.3436851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3436927Z 2025-05-07T20:32:16.3437015Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3437019Z 2025-05-07T20:32:16.3437113Z moe/activation_test.py:117: 2025-05-07T20:32:16.3437239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3437340Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3437436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3437800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3437891Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3438392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3438498Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3438869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3439089Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3439423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3439513Z kernel = self.compile( 2025-05-07T20:32:16.3439883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3440057Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3440183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3440188Z 2025-05-07T20:32:16.3440391Z self = 2025-05-07T20:32:16.3441145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3441647Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59033420>} 2025-05-07T20:32:16.3442375Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3442561Z context = 2025-05-07T20:32:16.3442647Z 2025-05-07T20:32:16.3442814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3443069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3443176Z module_map=module_map) 2025-05-07T20:32:16.3443332Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3443504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3443580Z E ^ 2025-05-07T20:32:16.3443924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3443928Z 2025-05-07T20:32:16.3444330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3444339Z 2025-05-07T20:32:16.3444435Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3444655Z self=, 2025-05-07T20:32:16.3444733Z T=128, 2025-05-07T20:32:16.3444806Z D=5120, 2025-05-07T20:32:16.3444885Z scale_ub=1200.0, 2025-05-07T20:32:16.3444974Z contiguous=False, 2025-05-07T20:32:16.3445056Z compiled=True, 2025-05-07T20:32:16.3445127Z ) 2025-05-07T20:32:16.3445342Z self = 2025-05-07T20:32:16.3445506Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3445517Z 2025-05-07T20:32:16.3445590Z @given( 2025-05-07T20:32:16.3445709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3445804Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3445916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3446029Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3446138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3446214Z ) 2025-05-07T20:32:16.3446455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3446544Z def test_silu_mul_quant( 2025-05-07T20:32:16.3446619Z self, 2025-05-07T20:32:16.3446693Z T: int, 2025-05-07T20:32:16.3446766Z D: int, 2025-05-07T20:32:16.3446866Z scale_ub: Optional[float], 2025-05-07T20:32:16.3446952Z contiguous: bool, 2025-05-07T20:32:16.3447040Z compiled: bool, 2025-05-07T20:32:16.3447112Z ) -> None: 2025-05-07T20:32:16.3447202Z torch.manual_seed(2025) 2025-05-07T20:32:16.3447276Z 2025-05-07T20:32:16.3447439Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3447509Z 2025-05-07T20:32:16.3447599Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3447721Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3447804Z x = x_sign * x_clamp 2025-05-07T20:32:16.3447887Z x0 = x[:, :D] 2025-05-07T20:32:16.3447967Z x1 = x[:, D:] 2025-05-07T20:32:16.3448036Z 2025-05-07T20:32:16.3448117Z if contiguous: 2025-05-07T20:32:16.3448204Z x0 = x0.contiguous() 2025-05-07T20:32:16.3448290Z x1 = x1.contiguous() 2025-05-07T20:32:16.3448360Z 2025-05-07T20:32:16.3448447Z if scale_ub is not None: 2025-05-07T20:32:16.3448550Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3448678Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3448757Z ) 2025-05-07T20:32:16.3448834Z else: 2025-05-07T20:32:16.3448928Z scale_ub_tensor = None 2025-05-07T20:32:16.3448999Z 2025-05-07T20:32:16.3449127Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3449212Z op = silu_mul_quant 2025-05-07T20:32:16.3449291Z if compiled: 2025-05-07T20:32:16.3449391Z op = torch.compile(op) 2025-05-07T20:32:16.3449490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3449641Z 2025-05-07T20:32:16.3449737Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3449742Z 2025-05-07T20:32:16.3449834Z moe/activation_test.py:117: 2025-05-07T20:32:16.3449964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3450063Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3450158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3450594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3450683Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3451163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3451260Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3451608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3451835Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3452165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3452254Z kernel = self.compile( 2025-05-07T20:32:16.3452627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3452801Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3452926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3452930Z 2025-05-07T20:32:16.3453181Z self = 2025-05-07T20:32:16.3453939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3454432Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59032c00>} 2025-05-07T20:32:16.3455157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3455348Z context = 2025-05-07T20:32:16.3455353Z 2025-05-07T20:32:16.3455511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3455762Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3455865Z module_map=module_map) 2025-05-07T20:32:16.3456022Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3456122Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3456197Z E ^ 2025-05-07T20:32:16.3456542Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3456546Z 2025-05-07T20:32:16.3456948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3456953Z 2025-05-07T20:32:16.3457052Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3457272Z self=, 2025-05-07T20:32:16.3457347Z T=16384, 2025-05-07T20:32:16.3457420Z D=7168, 2025-05-07T20:32:16.3457504Z scale_ub=1200.0, 2025-05-07T20:32:16.3457586Z contiguous=True, 2025-05-07T20:32:16.3457663Z compiled=True, 2025-05-07T20:32:16.3457736Z ) 2025-05-07T20:32:16.3457948Z self = 2025-05-07T20:32:16.3458216Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.3458222Z 2025-05-07T20:32:16.3458312Z @given( 2025-05-07T20:32:16.3458446Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3458550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3458659Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3458772Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3458965Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3459033Z ) 2025-05-07T20:32:16.3459466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3459601Z def test_silu_mul_quant( 2025-05-07T20:32:16.3459693Z self, 2025-05-07T20:32:16.3459767Z T: int, 2025-05-07T20:32:16.3459843Z D: int, 2025-05-07T20:32:16.3459938Z scale_ub: Optional[float], 2025-05-07T20:32:16.3460022Z contiguous: bool, 2025-05-07T20:32:16.3460109Z compiled: bool, 2025-05-07T20:32:16.3460191Z ) -> None: 2025-05-07T20:32:16.3460287Z torch.manual_seed(2025) 2025-05-07T20:32:16.3460356Z 2025-05-07T20:32:16.3460519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3460593Z 2025-05-07T20:32:16.3460683Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3460806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3460895Z x = x_sign * x_clamp 2025-05-07T20:32:16.3460978Z x0 = x[:, :D] 2025-05-07T20:32:16.3461053Z x1 = x[:, D:] 2025-05-07T20:32:16.3461125Z 2025-05-07T20:32:16.3461204Z if contiguous: 2025-05-07T20:32:16.3461291Z x0 = x0.contiguous() 2025-05-07T20:32:16.3461379Z x1 = x1.contiguous() 2025-05-07T20:32:16.3461448Z 2025-05-07T20:32:16.3461534Z if scale_ub is not None: 2025-05-07T20:32:16.3461641Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3461774Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3461851Z ) 2025-05-07T20:32:16.3461926Z else: 2025-05-07T20:32:16.3462021Z scale_ub_tensor = None 2025-05-07T20:32:16.3462097Z 2025-05-07T20:32:16.3462223Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3462309Z op = silu_mul_quant 2025-05-07T20:32:16.3462393Z if compiled: 2025-05-07T20:32:16.3462487Z op = torch.compile(op) 2025-05-07T20:32:16.3462591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3462661Z 2025-05-07T20:32:16.3466563Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3466572Z 2025-05-07T20:32:16.3466671Z moe/activation_test.py:117: 2025-05-07T20:32:16.3466798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3466902Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3467001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3467378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3467467Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3467950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3468047Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3468393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3468615Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3468952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3469042Z kernel = self.compile( 2025-05-07T20:32:16.3469419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3469773Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3469906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3469911Z 2025-05-07T20:32:16.3470114Z self = 2025-05-07T20:32:16.3470876Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3472041Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a59b67740>} 2025-05-07T20:32:16.3472773Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3472972Z context = 2025-05-07T20:32:16.3472977Z 2025-05-07T20:32:16.3473138Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3473393Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3473503Z module_map=module_map) 2025-05-07T20:32:16.3473666Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3473761Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3473838Z E ^ 2025-05-07T20:32:16.3474184Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3474189Z 2025-05-07T20:32:16.3474597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3474601Z 2025-05-07T20:32:16.3474698Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3474917Z self=, 2025-05-07T20:32:16.3474993Z T=16384, 2025-05-07T20:32:16.3475066Z D=5120, 2025-05-07T20:32:16.3475144Z scale_ub=1200.0, 2025-05-07T20:32:16.3475230Z contiguous=True, 2025-05-07T20:32:16.3475310Z compiled=False, 2025-05-07T20:32:16.3475384Z ) 2025-05-07T20:32:16.3475596Z self = 2025-05-07T20:32:16.3475773Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.3475778Z 2025-05-07T20:32:16.3475854Z @given( 2025-05-07T20:32:16.3475973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3476067Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3476180Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3476292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3476403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3476478Z ) 2025-05-07T20:32:16.3476718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3476810Z def test_silu_mul_quant( 2025-05-07T20:32:16.3476881Z self, 2025-05-07T20:32:16.3476956Z T: int, 2025-05-07T20:32:16.3477033Z D: int, 2025-05-07T20:32:16.3477129Z scale_ub: Optional[float], 2025-05-07T20:32:16.3477220Z contiguous: bool, 2025-05-07T20:32:16.3477307Z compiled: bool, 2025-05-07T20:32:16.3477384Z ) -> None: 2025-05-07T20:32:16.3477476Z torch.manual_seed(2025) 2025-05-07T20:32:16.3477554Z 2025-05-07T20:32:16.3477717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3477791Z 2025-05-07T20:32:16.3477884Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3478009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3478099Z x = x_sign * x_clamp 2025-05-07T20:32:16.3478286Z x0 = x[:, :D] 2025-05-07T20:32:16.3478384Z x1 = x[:, D:] 2025-05-07T20:32:16.3478464Z 2025-05-07T20:32:16.3478544Z if contiguous: 2025-05-07T20:32:16.3478632Z x0 = x0.contiguous() 2025-05-07T20:32:16.3478727Z x1 = x1.contiguous() 2025-05-07T20:32:16.3478794Z 2025-05-07T20:32:16.3478880Z if scale_ub is not None: 2025-05-07T20:32:16.3478984Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3479190Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3479266Z ) 2025-05-07T20:32:16.3479339Z else: 2025-05-07T20:32:16.3479430Z scale_ub_tensor = None 2025-05-07T20:32:16.3479501Z 2025-05-07T20:32:16.3479631Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3479722Z op = silu_mul_quant 2025-05-07T20:32:16.3479809Z if compiled: 2025-05-07T20:32:16.3479906Z op = torch.compile(op) 2025-05-07T20:32:16.3480014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3480092Z 2025-05-07T20:32:16.3480180Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3480185Z 2025-05-07T20:32:16.3480277Z moe/activation_test.py:117: 2025-05-07T20:32:16.3480405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3480503Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3480603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3481095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3481187Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3481539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3481757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3482090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3482181Z kernel = self.compile( 2025-05-07T20:32:16.3482554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3482726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3482847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3482857Z 2025-05-07T20:32:16.3483057Z self = 2025-05-07T20:32:16.3483819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3484316Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a596c7560>} 2025-05-07T20:32:16.3485045Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3485230Z context = 2025-05-07T20:32:16.3485239Z 2025-05-07T20:32:16.3485401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3485655Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3485759Z module_map=module_map) 2025-05-07T20:32:16.3485920Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3486015Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3486088Z E ^ 2025-05-07T20:32:16.3486517Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3486522Z 2025-05-07T20:32:16.3486926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3486931Z 2025-05-07T20:32:16.3487036Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3487253Z self=, 2025-05-07T20:32:16.3487401Z T=1, 2025-05-07T20:32:16.3487478Z D=7168, 2025-05-07T20:32:16.3487557Z scale_ub=1200.0, 2025-05-07T20:32:16.3487640Z contiguous=False, 2025-05-07T20:32:16.3487721Z compiled=False, 2025-05-07T20:32:16.3487791Z ) 2025-05-07T20:32:16.3488005Z self = 2025-05-07T20:32:16.3488196Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.3488201Z 2025-05-07T20:32:16.3488284Z @given( 2025-05-07T20:32:16.3488423Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3488518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3488629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3488746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3488855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3488924Z ) 2025-05-07T20:32:16.3489164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3489256Z def test_silu_mul_quant( 2025-05-07T20:32:16.3489327Z self, 2025-05-07T20:32:16.3489401Z T: int, 2025-05-07T20:32:16.3489475Z D: int, 2025-05-07T20:32:16.3489570Z scale_ub: Optional[float], 2025-05-07T20:32:16.3489659Z contiguous: bool, 2025-05-07T20:32:16.3489738Z compiled: bool, 2025-05-07T20:32:16.3489817Z ) -> None: 2025-05-07T20:32:16.3489908Z torch.manual_seed(2025) 2025-05-07T20:32:16.3489978Z 2025-05-07T20:32:16.3490151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3490221Z 2025-05-07T20:32:16.3490309Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3490432Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3490516Z x = x_sign * x_clamp 2025-05-07T20:32:16.3490590Z x0 = x[:, :D] 2025-05-07T20:32:16.3490669Z x1 = x[:, D:] 2025-05-07T20:32:16.3490744Z 2025-05-07T20:32:16.3490822Z if contiguous: 2025-05-07T20:32:16.3490914Z x0 = x0.contiguous() 2025-05-07T20:32:16.3491000Z x1 = x1.contiguous() 2025-05-07T20:32:16.3491071Z 2025-05-07T20:32:16.3491157Z if scale_ub is not None: 2025-05-07T20:32:16.3491258Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3491389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3491462Z ) 2025-05-07T20:32:16.3491531Z else: 2025-05-07T20:32:16.3491629Z scale_ub_tensor = None 2025-05-07T20:32:16.3491697Z 2025-05-07T20:32:16.3491822Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3491908Z op = silu_mul_quant 2025-05-07T20:32:16.3491990Z if compiled: 2025-05-07T20:32:16.3492085Z op = torch.compile(op) 2025-05-07T20:32:16.3492191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3492265Z 2025-05-07T20:32:16.3492355Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3492360Z 2025-05-07T20:32:16.3492452Z moe/activation_test.py:117: 2025-05-07T20:32:16.3492575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3492676Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3492770Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3493333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3493515Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3493866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3494087Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3494416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3494581Z kernel = self.compile( 2025-05-07T20:32:16.3494955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3495122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3495243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3495253Z 2025-05-07T20:32:16.3495452Z self = 2025-05-07T20:32:16.3496211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3496706Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a704ddf80>} 2025-05-07T20:32:16.3497439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3497626Z context = 2025-05-07T20:32:16.3497630Z 2025-05-07T20:32:16.3497789Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3498043Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3498157Z module_map=module_map) 2025-05-07T20:32:16.3498338Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3498444Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3498535Z E ^ 2025-05-07T20:32:16.3498879Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3498888Z 2025-05-07T20:32:16.3499292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3499297Z 2025-05-07T20:32:16.3499396Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3499611Z self=, 2025-05-07T20:32:16.3499686Z T=4096, 2025-05-07T20:32:16.3499756Z D=7168, 2025-05-07T20:32:16.3499839Z scale_ub=1200.0, 2025-05-07T20:32:16.3499922Z contiguous=False, 2025-05-07T20:32:16.3500006Z compiled=True, 2025-05-07T20:32:16.3500078Z ) 2025-05-07T20:32:16.3500289Z self = 2025-05-07T20:32:16.3500459Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3500463Z 2025-05-07T20:32:16.3500538Z @given( 2025-05-07T20:32:16.3500653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3500750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3500864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3500979Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3501091Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3501163Z ) 2025-05-07T20:32:16.3501399Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3501492Z def test_silu_mul_quant( 2025-05-07T20:32:16.3501566Z self, 2025-05-07T20:32:16.3501637Z T: int, 2025-05-07T20:32:16.3501886Z D: int, 2025-05-07T20:32:16.3501986Z scale_ub: Optional[float], 2025-05-07T20:32:16.3502072Z contiguous: bool, 2025-05-07T20:32:16.3502156Z compiled: bool, 2025-05-07T20:32:16.3502232Z ) -> None: 2025-05-07T20:32:16.3502321Z torch.manual_seed(2025) 2025-05-07T20:32:16.3502391Z 2025-05-07T20:32:16.3502556Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3502707Z 2025-05-07T20:32:16.3502797Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3502919Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3503005Z x = x_sign * x_clamp 2025-05-07T20:32:16.3503079Z x0 = x[:, :D] 2025-05-07T20:32:16.3503156Z x1 = x[:, D:] 2025-05-07T20:32:16.3503229Z 2025-05-07T20:32:16.3503308Z if contiguous: 2025-05-07T20:32:16.3503393Z x0 = x0.contiguous() 2025-05-07T20:32:16.3503481Z x1 = x1.contiguous() 2025-05-07T20:32:16.3503550Z 2025-05-07T20:32:16.3503642Z if scale_ub is not None: 2025-05-07T20:32:16.3503747Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3503876Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3503948Z ) 2025-05-07T20:32:16.3504025Z else: 2025-05-07T20:32:16.3504117Z scale_ub_tensor = None 2025-05-07T20:32:16.3504190Z 2025-05-07T20:32:16.3504322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3504409Z op = silu_mul_quant 2025-05-07T20:32:16.3504495Z if compiled: 2025-05-07T20:32:16.3504590Z op = torch.compile(op) 2025-05-07T20:32:16.3504691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3504761Z 2025-05-07T20:32:16.3504848Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3504853Z 2025-05-07T20:32:16.3504946Z moe/activation_test.py:117: 2025-05-07T20:32:16.3505078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3505175Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3505272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3505631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3505719Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3506204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3506302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3506650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3506869Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3507200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3507297Z kernel = self.compile( 2025-05-07T20:32:16.3507669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3507837Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3507961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3507966Z 2025-05-07T20:32:16.3508164Z self = 2025-05-07T20:32:16.3508925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3509415Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a701edb20>} 2025-05-07T20:32:16.3510227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3510415Z context = 2025-05-07T20:32:16.3510420Z 2025-05-07T20:32:16.3510578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3510913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3511014Z module_map=module_map) 2025-05-07T20:32:16.3511170Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3511269Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3511341Z E ^ 2025-05-07T20:32:16.3511690Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3511695Z 2025-05-07T20:32:16.3512100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3512104Z 2025-05-07T20:32:16.3512200Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3512419Z self=, 2025-05-07T20:32:16.3512491Z T=128, 2025-05-07T20:32:16.3512566Z D=7168, 2025-05-07T20:32:16.3512651Z scale_ub=1200.0, 2025-05-07T20:32:16.3512735Z contiguous=False, 2025-05-07T20:32:16.3512817Z compiled=True, 2025-05-07T20:32:16.3512885Z ) 2025-05-07T20:32:16.3513097Z self = 2025-05-07T20:32:16.3513264Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.3513269Z 2025-05-07T20:32:16.3513344Z @given( 2025-05-07T20:32:16.3513464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3513561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3513673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3513785Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3513899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3513970Z ) 2025-05-07T20:32:16.3514210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3514302Z def test_silu_mul_quant( 2025-05-07T20:32:16.3514378Z self, 2025-05-07T20:32:16.3514455Z T: int, 2025-05-07T20:32:16.3514527Z D: int, 2025-05-07T20:32:16.3514621Z scale_ub: Optional[float], 2025-05-07T20:32:16.3514709Z contiguous: bool, 2025-05-07T20:32:16.3514790Z compiled: bool, 2025-05-07T20:32:16.3514862Z ) -> None: 2025-05-07T20:32:16.3514954Z torch.manual_seed(2025) 2025-05-07T20:32:16.3515024Z 2025-05-07T20:32:16.3515187Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3515266Z 2025-05-07T20:32:16.3515355Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3515480Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3515564Z x = x_sign * x_clamp 2025-05-07T20:32:16.3515641Z x0 = x[:, :D] 2025-05-07T20:32:16.3515719Z x1 = x[:, D:] 2025-05-07T20:32:16.3515786Z 2025-05-07T20:32:16.3515868Z if contiguous: 2025-05-07T20:32:16.3515959Z x0 = x0.contiguous() 2025-05-07T20:32:16.3516049Z x1 = x1.contiguous() 2025-05-07T20:32:16.3516116Z 2025-05-07T20:32:16.3516207Z if scale_ub is not None: 2025-05-07T20:32:16.3516309Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3516438Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3516513Z ) 2025-05-07T20:32:16.3516586Z else: 2025-05-07T20:32:16.3516677Z scale_ub_tensor = None 2025-05-07T20:32:16.3516749Z 2025-05-07T20:32:16.3516957Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3517047Z op = silu_mul_quant 2025-05-07T20:32:16.3517128Z if compiled: 2025-05-07T20:32:16.3517223Z op = torch.compile(op) 2025-05-07T20:32:16.3517327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3517398Z 2025-05-07T20:32:16.3517484Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3517488Z 2025-05-07T20:32:16.3517584Z moe/activation_test.py:117: 2025-05-07T20:32:16.3517785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3517883Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3517981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3518384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3518482Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3518969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3519066Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3519419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3519633Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3519964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3520060Z kernel = self.compile( 2025-05-07T20:32:16.3520430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3520598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3520720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3520724Z 2025-05-07T20:32:16.3520928Z self = 2025-05-07T20:32:16.3521685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3522174Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a70252480>} 2025-05-07T20:32:16.3522910Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3523094Z context = 2025-05-07T20:32:16.3523098Z 2025-05-07T20:32:16.3523260Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3523518Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3523620Z module_map=module_map) 2025-05-07T20:32:16.3523781Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3523877Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3523951Z E ^ 2025-05-07T20:32:16.3524300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3524309Z 2025-05-07T20:32:16.3524711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3524715Z 2025-05-07T20:32:16.3524815Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3525031Z self=, 2025-05-07T20:32:16.3525104Z T=2048, 2025-05-07T20:32:16.3525183Z D=7168, 2025-05-07T20:32:16.3525259Z scale_ub=None, 2025-05-07T20:32:16.3525423Z contiguous=True, 2025-05-07T20:32:16.3525507Z compiled=True, 2025-05-07T20:32:16.3525577Z ) 2025-05-07T20:32:16.3525790Z self = 2025-05-07T20:32:16.3525952Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.3525956Z 2025-05-07T20:32:16.3526029Z @given( 2025-05-07T20:32:16.3526147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3526345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3526463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3526576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3526692Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3526761Z ) 2025-05-07T20:32:16.3526998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3527091Z def test_silu_mul_quant( 2025-05-07T20:32:16.3527169Z self, 2025-05-07T20:32:16.3527242Z T: int, 2025-05-07T20:32:16.3527317Z D: int, 2025-05-07T20:32:16.3527413Z scale_ub: Optional[float], 2025-05-07T20:32:16.3527500Z contiguous: bool, 2025-05-07T20:32:16.3527590Z compiled: bool, 2025-05-07T20:32:16.3527666Z ) -> None: 2025-05-07T20:32:16.3527758Z torch.manual_seed(2025) 2025-05-07T20:32:16.3527831Z 2025-05-07T20:32:16.3528001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3528072Z 2025-05-07T20:32:16.3528162Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3528284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3528374Z x = x_sign * x_clamp 2025-05-07T20:32:16.3528452Z x0 = x[:, :D] 2025-05-07T20:32:16.3528529Z x1 = x[:, D:] 2025-05-07T20:32:16.3528601Z 2025-05-07T20:32:16.3528681Z if contiguous: 2025-05-07T20:32:16.3528768Z x0 = x0.contiguous() 2025-05-07T20:32:16.3528860Z x1 = x1.contiguous() 2025-05-07T20:32:16.3528927Z 2025-05-07T20:32:16.3529016Z if scale_ub is not None: 2025-05-07T20:32:16.3529121Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3529250Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3529323Z ) 2025-05-07T20:32:16.3529396Z else: 2025-05-07T20:32:16.3529486Z scale_ub_tensor = None 2025-05-07T20:32:16.3529565Z 2025-05-07T20:32:16.3529693Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3529779Z op = silu_mul_quant 2025-05-07T20:32:16.3529861Z if compiled: 2025-05-07T20:32:16.3529957Z op = torch.compile(op) 2025-05-07T20:32:16.3530056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3530128Z 2025-05-07T20:32:16.3530213Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3530217Z 2025-05-07T20:32:16.3530307Z moe/activation_test.py:117: 2025-05-07T20:32:16.3530438Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3530536Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3530633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3530990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3531077Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3531567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3531660Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3532005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3532224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3532634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3532732Z kernel = self.compile( 2025-05-07T20:32:16.3533151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3533319Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3533446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3533527Z 2025-05-07T20:32:16.3533726Z self = 2025-05-07T20:32:16.3534483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3534984Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a42ce5da0>} 2025-05-07T20:32:16.3535709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3535895Z context = 2025-05-07T20:32:16.3535900Z 2025-05-07T20:32:16.3536060Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3536323Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3536425Z module_map=module_map) 2025-05-07T20:32:16.3536578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3536675Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3536746Z E ^ 2025-05-07T20:32:16.3537096Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3537106Z 2025-05-07T20:32:16.3537507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3537511Z 2025-05-07T20:32:16.3537610Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3537829Z self=, 2025-05-07T20:32:16.3537903Z T=16384, 2025-05-07T20:32:16.3537978Z D=5120, 2025-05-07T20:32:16.3538057Z scale_ub=None, 2025-05-07T20:32:16.3538140Z contiguous=False, 2025-05-07T20:32:16.3538221Z compiled=False, 2025-05-07T20:32:16.3538294Z ) 2025-05-07T20:32:16.3538506Z self = 2025-05-07T20:32:16.3538681Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.3538685Z 2025-05-07T20:32:16.3538757Z @given( 2025-05-07T20:32:16.3538873Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3538979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3539088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3539200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3539311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3539385Z ) 2025-05-07T20:32:16.3539626Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3539721Z def test_silu_mul_quant( 2025-05-07T20:32:16.3539794Z self, 2025-05-07T20:32:16.3539870Z T: int, 2025-05-07T20:32:16.3539945Z D: int, 2025-05-07T20:32:16.3540037Z scale_ub: Optional[float], 2025-05-07T20:32:16.3540125Z contiguous: bool, 2025-05-07T20:32:16.3540204Z compiled: bool, 2025-05-07T20:32:16.3540278Z ) -> None: 2025-05-07T20:32:16.3540373Z torch.manual_seed(2025) 2025-05-07T20:32:16.3540443Z 2025-05-07T20:32:16.3540687Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3540767Z 2025-05-07T20:32:16.3540854Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3540975Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3542754Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3542836Z 2025-05-07T20:32:16.3542953Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.3542958Z 2025-05-07T20:32:16.3543054Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3543272Z self=, 2025-05-07T20:32:16.3543350Z T=4096, 2025-05-07T20:32:16.3543424Z D=7168, 2025-05-07T20:32:16.3543503Z scale_ub=1200.0, 2025-05-07T20:32:16.3543587Z contiguous=True, 2025-05-07T20:32:16.3543665Z compiled=True, 2025-05-07T20:32:16.3543736Z ) 2025-05-07T20:32:16.3543950Z self = 2025-05-07T20:32:16.3544119Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.3544124Z 2025-05-07T20:32:16.3544202Z @given( 2025-05-07T20:32:16.3544314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3544410Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3544521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3544632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3544746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3544819Z ) 2025-05-07T20:32:16.3545056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3545144Z def test_silu_mul_quant( 2025-05-07T20:32:16.3545222Z self, 2025-05-07T20:32:16.3545297Z T: int, 2025-05-07T20:32:16.3545369Z D: int, 2025-05-07T20:32:16.3545466Z scale_ub: Optional[float], 2025-05-07T20:32:16.3545561Z contiguous: bool, 2025-05-07T20:32:16.3545647Z compiled: bool, 2025-05-07T20:32:16.3545721Z ) -> None: 2025-05-07T20:32:16.3545809Z torch.manual_seed(2025) 2025-05-07T20:32:16.3545881Z 2025-05-07T20:32:16.3546043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3546114Z 2025-05-07T20:32:16.3546206Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3546326Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3548087Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3548097Z 2025-05-07T20:32:16.3548211Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.3548216Z 2025-05-07T20:32:16.3548314Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3548531Z self=, 2025-05-07T20:32:16.3548607Z T=16384, 2025-05-07T20:32:16.3548682Z D=7168, 2025-05-07T20:32:16.3548761Z scale_ub=None, 2025-05-07T20:32:16.3548844Z contiguous=False, 2025-05-07T20:32:16.3549007Z compiled=False, 2025-05-07T20:32:16.3549078Z ) 2025-05-07T20:32:16.3549288Z self = 2025-05-07T20:32:16.3549462Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.3549467Z 2025-05-07T20:32:16.3549541Z @given( 2025-05-07T20:32:16.3549652Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3549829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3549939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3550055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3550163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3550234Z ) 2025-05-07T20:32:16.3550476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3550565Z def test_silu_mul_quant( 2025-05-07T20:32:16.3550638Z self, 2025-05-07T20:32:16.3550719Z T: int, 2025-05-07T20:32:16.3550793Z D: int, 2025-05-07T20:32:16.3550887Z scale_ub: Optional[float], 2025-05-07T20:32:16.3550975Z contiguous: bool, 2025-05-07T20:32:16.3551057Z compiled: bool, 2025-05-07T20:32:16.3551132Z ) -> None: 2025-05-07T20:32:16.3551228Z torch.manual_seed(2025) 2025-05-07T20:32:16.3551295Z 2025-05-07T20:32:16.3551459Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3553226Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3553232Z 2025-05-07T20:32:16.3553348Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3553353Z 2025-05-07T20:32:16.3553449Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3553661Z self=, 2025-05-07T20:32:16.3553739Z T=2048, 2025-05-07T20:32:16.3553814Z D=7168, 2025-05-07T20:32:16.3553891Z scale_ub=1200.0, 2025-05-07T20:32:16.3553972Z contiguous=True, 2025-05-07T20:32:16.3554050Z compiled=True, 2025-05-07T20:32:16.3554119Z ) 2025-05-07T20:32:16.3554335Z self = 2025-05-07T20:32:16.3554497Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.3554502Z 2025-05-07T20:32:16.3554579Z @given( 2025-05-07T20:32:16.3554691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3554793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3554904Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3555016Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3555127Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3555201Z ) 2025-05-07T20:32:16.3555437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3555536Z def test_silu_mul_quant( 2025-05-07T20:32:16.3555611Z self, 2025-05-07T20:32:16.3555686Z T: int, 2025-05-07T20:32:16.3555764Z D: int, 2025-05-07T20:32:16.3555858Z scale_ub: Optional[float], 2025-05-07T20:32:16.3555943Z contiguous: bool, 2025-05-07T20:32:16.3556025Z compiled: bool, 2025-05-07T20:32:16.3556100Z ) -> None: 2025-05-07T20:32:16.3556192Z torch.manual_seed(2025) 2025-05-07T20:32:16.3556264Z 2025-05-07T20:32:16.3556511Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3556584Z 2025-05-07T20:32:16.3556679Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3556798Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3558542Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3558646Z 2025-05-07T20:32:16.3558761Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.3558765Z 2025-05-07T20:32:16.3558866Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3559087Z self=, 2025-05-07T20:32:16.3559158Z T=2048, 2025-05-07T20:32:16.3559572Z D=7168, 2025-05-07T20:32:16.3559668Z scale_ub=None, 2025-05-07T20:32:16.3559748Z contiguous=True, 2025-05-07T20:32:16.3559832Z compiled=False, 2025-05-07T20:32:16.3559904Z ) 2025-05-07T20:32:16.3560113Z self = 2025-05-07T20:32:16.3560290Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3560295Z 2025-05-07T20:32:16.3560370Z @given( 2025-05-07T20:32:16.3560489Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3560586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3560694Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3560811Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3560919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3560997Z ) 2025-05-07T20:32:16.3561243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3561330Z def test_silu_mul_quant( 2025-05-07T20:32:16.3561403Z self, 2025-05-07T20:32:16.3561481Z T: int, 2025-05-07T20:32:16.3561555Z D: int, 2025-05-07T20:32:16.3561647Z scale_ub: Optional[float], 2025-05-07T20:32:16.3561735Z contiguous: bool, 2025-05-07T20:32:16.3561819Z compiled: bool, 2025-05-07T20:32:16.3561897Z ) -> None: 2025-05-07T20:32:16.3561986Z torch.manual_seed(2025) 2025-05-07T20:32:16.3562058Z 2025-05-07T20:32:16.3562225Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3562294Z 2025-05-07T20:32:16.3562383Z > x_sign = torch.sign(x) 2025-05-07T20:32:16.3564121Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3564131Z 2025-05-07T20:32:16.3564245Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:16.3564249Z 2025-05-07T20:32:16.3564348Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3564565Z self=, 2025-05-07T20:32:16.3564638Z T=1, 2025-05-07T20:32:16.3564719Z D=7168, 2025-05-07T20:32:16.3564800Z scale_ub=1200.0, 2025-05-07T20:32:16.3564886Z contiguous=True, 2025-05-07T20:32:16.3564968Z compiled=False, 2025-05-07T20:32:16.3565037Z ) 2025-05-07T20:32:16.3565393Z self = 2025-05-07T20:32:16.3565553Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.3565558Z 2025-05-07T20:32:16.3565632Z @given( 2025-05-07T20:32:16.3565749Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3565844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3565954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3566181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3566293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3566368Z ) 2025-05-07T20:32:16.3566608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3566699Z def test_silu_mul_quant( 2025-05-07T20:32:16.3566778Z self, 2025-05-07T20:32:16.3566852Z T: int, 2025-05-07T20:32:16.3566925Z D: int, 2025-05-07T20:32:16.3567020Z scale_ub: Optional[float], 2025-05-07T20:32:16.3567113Z contiguous: bool, 2025-05-07T20:32:16.3567196Z compiled: bool, 2025-05-07T20:32:16.3567275Z ) -> None: 2025-05-07T20:32:16.3567365Z torch.manual_seed(2025) 2025-05-07T20:32:16.3567437Z 2025-05-07T20:32:16.3567601Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3567670Z 2025-05-07T20:32:16.3567763Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3567889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3567975Z x = x_sign * x_clamp 2025-05-07T20:32:16.3568057Z x0 = x[:, :D] 2025-05-07T20:32:16.3568133Z x1 = x[:, D:] 2025-05-07T20:32:16.3568202Z 2025-05-07T20:32:16.3568290Z if contiguous: 2025-05-07T20:32:16.3568391Z x0 = x0.contiguous() 2025-05-07T20:32:16.3568489Z x1 = x1.contiguous() 2025-05-07T20:32:16.3568574Z 2025-05-07T20:32:16.3568677Z if scale_ub is not None: 2025-05-07T20:32:16.3568786Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3568918Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3568991Z ) 2025-05-07T20:32:16.3569068Z else: 2025-05-07T20:32:16.3569158Z scale_ub_tensor = None 2025-05-07T20:32:16.3569226Z 2025-05-07T20:32:16.3569356Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3569442Z op = silu_mul_quant 2025-05-07T20:32:16.3569531Z if compiled: 2025-05-07T20:32:16.3569631Z op = torch.compile(op) 2025-05-07T20:32:16.3569732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3569804Z 2025-05-07T20:32:16.3569897Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3569901Z 2025-05-07T20:32:16.3569997Z moe/activation_test.py:117: 2025-05-07T20:32:16.3570121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3570220Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3570321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3570814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3570908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3571258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3571487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3571819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3571913Z kernel = self.compile( 2025-05-07T20:32:16.3572286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3572457Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3572666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3572671Z 2025-05-07T20:32:16.3572872Z self = 2025-05-07T20:32:16.3573680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3574251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a421c1440>} 2025-05-07T20:32:16.3574982Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3575172Z context = 2025-05-07T20:32:16.3575182Z 2025-05-07T20:32:16.3575343Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3575603Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3575708Z module_map=module_map) 2025-05-07T20:32:16.3575865Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3575969Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3576043Z E ^ 2025-05-07T20:32:16.3576386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3576391Z 2025-05-07T20:32:16.3576796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3576800Z 2025-05-07T20:32:16.3576899Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3577121Z self=, 2025-05-07T20:32:16.3577194Z T=128, 2025-05-07T20:32:16.3577268Z D=5120, 2025-05-07T20:32:16.3577348Z scale_ub=None, 2025-05-07T20:32:16.3577431Z contiguous=True, 2025-05-07T20:32:16.3577513Z compiled=False, 2025-05-07T20:32:16.3577584Z ) 2025-05-07T20:32:16.3577795Z self = 2025-05-07T20:32:16.3577961Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3577969Z 2025-05-07T20:32:16.3578042Z @given( 2025-05-07T20:32:16.3578155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3578255Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3578364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3578478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3578593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3578666Z ) 2025-05-07T20:32:16.3578909Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3579003Z def test_silu_mul_quant( 2025-05-07T20:32:16.3579080Z self, 2025-05-07T20:32:16.3579159Z T: int, 2025-05-07T20:32:16.3579231Z D: int, 2025-05-07T20:32:16.3579326Z scale_ub: Optional[float], 2025-05-07T20:32:16.3579416Z contiguous: bool, 2025-05-07T20:32:16.3579499Z compiled: bool, 2025-05-07T20:32:16.3579581Z ) -> None: 2025-05-07T20:32:16.3579677Z torch.manual_seed(2025) 2025-05-07T20:32:16.3579749Z 2025-05-07T20:32:16.3579910Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3579984Z 2025-05-07T20:32:16.3580073Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3580196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3580286Z x = x_sign * x_clamp 2025-05-07T20:32:16.3580360Z x0 = x[:, :D] 2025-05-07T20:32:16.3580521Z x1 = x[:, D:] 2025-05-07T20:32:16.3580597Z 2025-05-07T20:32:16.3580677Z if contiguous: 2025-05-07T20:32:16.3580768Z x0 = x0.contiguous() 2025-05-07T20:32:16.3580853Z x1 = x1.contiguous() 2025-05-07T20:32:16.3580924Z 2025-05-07T20:32:16.3581013Z if scale_ub is not None: 2025-05-07T20:32:16.3581116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3581248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3581401Z ) 2025-05-07T20:32:16.3581476Z else: 2025-05-07T20:32:16.3581568Z scale_ub_tensor = None 2025-05-07T20:32:16.3581638Z 2025-05-07T20:32:16.3581763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3581848Z op = silu_mul_quant 2025-05-07T20:32:16.3581933Z if compiled: 2025-05-07T20:32:16.3582028Z op = torch.compile(op) 2025-05-07T20:32:16.3582135Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3582209Z 2025-05-07T20:32:16.3582295Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3582300Z 2025-05-07T20:32:16.3582397Z moe/activation_test.py:117: 2025-05-07T20:32:16.3582522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3582618Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3582719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3583207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3583308Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3583656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3583874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3584207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3588190Z kernel = self.compile( 2025-05-07T20:32:16.3588617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3588789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3588917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3588928Z 2025-05-07T20:32:16.3589128Z self = 2025-05-07T20:32:16.3589888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3590380Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a421c2520>} 2025-05-07T20:32:16.3591111Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3591299Z context = 2025-05-07T20:32:16.3591304Z 2025-05-07T20:32:16.3591466Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3591727Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3591832Z module_map=module_map) 2025-05-07T20:32:16.3591990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3592089Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3592162Z E ^ 2025-05-07T20:32:16.3592512Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3592644Z 2025-05-07T20:32:16.3593051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3593055Z 2025-05-07T20:32:16.3593152Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3593373Z self=, 2025-05-07T20:32:16.3593446Z T=128, 2025-05-07T20:32:16.3593603Z D=7168, 2025-05-07T20:32:16.3593684Z scale_ub=None, 2025-05-07T20:32:16.3593763Z contiguous=True, 2025-05-07T20:32:16.3593850Z compiled=False, 2025-05-07T20:32:16.3593921Z ) 2025-05-07T20:32:16.3594133Z self = 2025-05-07T20:32:16.3594300Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3594304Z 2025-05-07T20:32:16.3594377Z @given( 2025-05-07T20:32:16.3594492Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3594597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3594708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3594818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3594934Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3595005Z ) 2025-05-07T20:32:16.3595248Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3595343Z def test_silu_mul_quant( 2025-05-07T20:32:16.3595415Z self, 2025-05-07T20:32:16.3595491Z T: int, 2025-05-07T20:32:16.3595562Z D: int, 2025-05-07T20:32:16.3595658Z scale_ub: Optional[float], 2025-05-07T20:32:16.3595748Z contiguous: bool, 2025-05-07T20:32:16.3595829Z compiled: bool, 2025-05-07T20:32:16.3595901Z ) -> None: 2025-05-07T20:32:16.3595997Z torch.manual_seed(2025) 2025-05-07T20:32:16.3596065Z 2025-05-07T20:32:16.3596232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3596305Z 2025-05-07T20:32:16.3596394Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3596519Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3596606Z x = x_sign * x_clamp 2025-05-07T20:32:16.3596681Z x0 = x[:, :D] 2025-05-07T20:32:16.3596763Z x1 = x[:, D:] 2025-05-07T20:32:16.3596833Z 2025-05-07T20:32:16.3596912Z if contiguous: 2025-05-07T20:32:16.3597013Z x0 = x0.contiguous() 2025-05-07T20:32:16.3597099Z x1 = x1.contiguous() 2025-05-07T20:32:16.3597167Z 2025-05-07T20:32:16.3597258Z if scale_ub is not None: 2025-05-07T20:32:16.3597359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3597491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3597564Z ) 2025-05-07T20:32:16.3597637Z else: 2025-05-07T20:32:16.3597734Z scale_ub_tensor = None 2025-05-07T20:32:16.3597805Z 2025-05-07T20:32:16.3597933Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3598023Z op = silu_mul_quant 2025-05-07T20:32:16.3598103Z if compiled: 2025-05-07T20:32:16.3598205Z op = torch.compile(op) 2025-05-07T20:32:16.3598332Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3598415Z 2025-05-07T20:32:16.3598514Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3598523Z 2025-05-07T20:32:16.3598618Z moe/activation_test.py:117: 2025-05-07T20:32:16.3598743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3598842Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3598942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3599428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3599527Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3599960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3600179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3600516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3600605Z kernel = self.compile( 2025-05-07T20:32:16.3600982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3601226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3601348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3601352Z 2025-05-07T20:32:16.3601554Z self = 2025-05-07T20:32:16.3602318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3602813Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a421c3420>} 2025-05-07T20:32:16.3603541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3603731Z context = 2025-05-07T20:32:16.3603736Z 2025-05-07T20:32:16.3603900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3604151Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3604255Z module_map=module_map) 2025-05-07T20:32:16.3604415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3604510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3604585Z E ^ 2025-05-07T20:32:16.3604935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3604940Z 2025-05-07T20:32:16.3605347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3605356Z 2025-05-07T20:32:16.3605453Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3605670Z self=, 2025-05-07T20:32:16.3605748Z T=2048, 2025-05-07T20:32:16.3605821Z D=7168, 2025-05-07T20:32:16.3605899Z scale_ub=1200.0, 2025-05-07T20:32:16.3605982Z contiguous=True, 2025-05-07T20:32:16.3606062Z compiled=False, 2025-05-07T20:32:16.3606131Z ) 2025-05-07T20:32:16.3606350Z self = 2025-05-07T20:32:16.3606519Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.3606523Z 2025-05-07T20:32:16.3606599Z @given( 2025-05-07T20:32:16.3606714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3606808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3606923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3607044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3607154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3607228Z ) 2025-05-07T20:32:16.3607469Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3607559Z def test_silu_mul_quant( 2025-05-07T20:32:16.3607632Z self, 2025-05-07T20:32:16.3607706Z T: int, 2025-05-07T20:32:16.3607786Z D: int, 2025-05-07T20:32:16.3607962Z scale_ub: Optional[float], 2025-05-07T20:32:16.3608049Z contiguous: bool, 2025-05-07T20:32:16.3608133Z compiled: bool, 2025-05-07T20:32:16.3608206Z ) -> None: 2025-05-07T20:32:16.3608311Z torch.manual_seed(2025) 2025-05-07T20:32:16.3608394Z 2025-05-07T20:32:16.3608582Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3610338Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3610446Z 2025-05-07T20:32:16.3610564Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3610568Z 2025-05-07T20:32:16.3610664Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3610885Z self=, 2025-05-07T20:32:16.3610957Z T=1, 2025-05-07T20:32:16.3611034Z D=5120, 2025-05-07T20:32:16.3611113Z scale_ub=1200.0, 2025-05-07T20:32:16.3611192Z contiguous=True, 2025-05-07T20:32:16.3611282Z compiled=False, 2025-05-07T20:32:16.3611348Z ) 2025-05-07T20:32:16.3611559Z self = 2025-05-07T20:32:16.3611720Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.3611725Z 2025-05-07T20:32:16.3611798Z @given( 2025-05-07T20:32:16.3611910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3612006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3612115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3612231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3612340Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3612413Z ) 2025-05-07T20:32:16.3612654Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3612744Z def test_silu_mul_quant( 2025-05-07T20:32:16.3612816Z self, 2025-05-07T20:32:16.3612891Z T: int, 2025-05-07T20:32:16.3612971Z D: int, 2025-05-07T20:32:16.3613130Z scale_ub: Optional[float], 2025-05-07T20:32:16.3613219Z contiguous: bool, 2025-05-07T20:32:16.3613301Z compiled: bool, 2025-05-07T20:32:16.3613375Z ) -> None: 2025-05-07T20:32:16.3613468Z torch.manual_seed(2025) 2025-05-07T20:32:16.3613538Z 2025-05-07T20:32:16.3613700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3613769Z 2025-05-07T20:32:16.3613856Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3613988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3614073Z x = x_sign * x_clamp 2025-05-07T20:32:16.3614150Z x0 = x[:, :D] 2025-05-07T20:32:16.3614228Z x1 = x[:, D:] 2025-05-07T20:32:16.3614299Z 2025-05-07T20:32:16.3614378Z if contiguous: 2025-05-07T20:32:16.3614469Z x0 = x0.contiguous() 2025-05-07T20:32:16.3614556Z x1 = x1.contiguous() 2025-05-07T20:32:16.3614627Z 2025-05-07T20:32:16.3614719Z if scale_ub is not None: 2025-05-07T20:32:16.3614820Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3614949Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3615025Z ) 2025-05-07T20:32:16.3615097Z else: 2025-05-07T20:32:16.3615189Z scale_ub_tensor = None 2025-05-07T20:32:16.3615256Z 2025-05-07T20:32:16.3615381Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3615473Z op = silu_mul_quant 2025-05-07T20:32:16.3615638Z if compiled: 2025-05-07T20:32:16.3615735Z op = torch.compile(op) 2025-05-07T20:32:16.3615842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3615912Z 2025-05-07T20:32:16.3615998Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3616003Z 2025-05-07T20:32:16.3616099Z moe/activation_test.py:117: 2025-05-07T20:32:16.3616225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3616399Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3616495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3616982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3617080Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3617430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3617653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3617991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3618082Z kernel = self.compile( 2025-05-07T20:32:16.3618508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3618681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3618804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3618808Z 2025-05-07T20:32:16.3619009Z self = 2025-05-07T20:32:16.3619773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3620265Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a420149a0>} 2025-05-07T20:32:16.3620993Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3621182Z context = 2025-05-07T20:32:16.3621190Z 2025-05-07T20:32:16.3621353Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3621605Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3621714Z module_map=module_map) 2025-05-07T20:32:16.3621871Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3621970Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3622046Z E ^ 2025-05-07T20:32:16.3622388Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3622392Z 2025-05-07T20:32:16.3622796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3622801Z 2025-05-07T20:32:16.3622904Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3623120Z self=, 2025-05-07T20:32:16.3623194Z T=2048, 2025-05-07T20:32:16.3623266Z D=5120, 2025-05-07T20:32:16.3623343Z scale_ub=None, 2025-05-07T20:32:16.3623435Z contiguous=True, 2025-05-07T20:32:16.3623514Z compiled=False, 2025-05-07T20:32:16.3623584Z ) 2025-05-07T20:32:16.3623797Z self = 2025-05-07T20:32:16.3624070Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3624075Z 2025-05-07T20:32:16.3624153Z @given( 2025-05-07T20:32:16.3624268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3624363Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3624476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3624587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3624768Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3624843Z ) 2025-05-07T20:32:16.3625081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3625169Z def test_silu_mul_quant( 2025-05-07T20:32:16.3625247Z self, 2025-05-07T20:32:16.3625320Z T: int, 2025-05-07T20:32:16.3625397Z D: int, 2025-05-07T20:32:16.3625490Z scale_ub: Optional[float], 2025-05-07T20:32:16.3625577Z contiguous: bool, 2025-05-07T20:32:16.3625660Z compiled: bool, 2025-05-07T20:32:16.3625738Z ) -> None: 2025-05-07T20:32:16.3625829Z torch.manual_seed(2025) 2025-05-07T20:32:16.3625904Z 2025-05-07T20:32:16.3626067Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3626138Z 2025-05-07T20:32:16.3626228Z > x_sign = torch.sign(x) 2025-05-07T20:32:16.3627974Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3627987Z 2025-05-07T20:32:16.3628108Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:16.3628112Z 2025-05-07T20:32:16.3628210Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3628427Z self=, 2025-05-07T20:32:16.3628499Z T=16384, 2025-05-07T20:32:16.3628571Z D=5120, 2025-05-07T20:32:16.3628648Z scale_ub=None, 2025-05-07T20:32:16.3628729Z contiguous=True, 2025-05-07T20:32:16.3628807Z compiled=False, 2025-05-07T20:32:16.3628882Z ) 2025-05-07T20:32:16.3629092Z self = 2025-05-07T20:32:16.3629262Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3629266Z 2025-05-07T20:32:16.3629343Z @given( 2025-05-07T20:32:16.3629455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3629554Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3629664Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3629779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3629891Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3629960Z ) 2025-05-07T20:32:16.3630199Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3630289Z def test_silu_mul_quant( 2025-05-07T20:32:16.3630360Z self, 2025-05-07T20:32:16.3630430Z T: int, 2025-05-07T20:32:16.3630513Z D: int, 2025-05-07T20:32:16.3630607Z scale_ub: Optional[float], 2025-05-07T20:32:16.3630693Z contiguous: bool, 2025-05-07T20:32:16.3630779Z compiled: bool, 2025-05-07T20:32:16.3630849Z ) -> None: 2025-05-07T20:32:16.3630944Z torch.manual_seed(2025) 2025-05-07T20:32:16.3631011Z 2025-05-07T20:32:16.3631171Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3633081Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3633157Z 2025-05-07T20:32:16.3633270Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3633274Z 2025-05-07T20:32:16.3633375Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3633590Z self=, 2025-05-07T20:32:16.3633662Z T=4096, 2025-05-07T20:32:16.3633739Z D=5120, 2025-05-07T20:32:16.3633816Z scale_ub=None, 2025-05-07T20:32:16.3633897Z contiguous=True, 2025-05-07T20:32:16.3633980Z compiled=False, 2025-05-07T20:32:16.3634051Z ) 2025-05-07T20:32:16.3634268Z self = 2025-05-07T20:32:16.3634436Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3634440Z 2025-05-07T20:32:16.3634517Z @given( 2025-05-07T20:32:16.3634633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3634727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3634842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3634956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3635065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3635137Z ) 2025-05-07T20:32:16.3635376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3635464Z def test_silu_mul_quant( 2025-05-07T20:32:16.3635541Z self, 2025-05-07T20:32:16.3635615Z T: int, 2025-05-07T20:32:16.3635688Z D: int, 2025-05-07T20:32:16.3635789Z scale_ub: Optional[float], 2025-05-07T20:32:16.3635873Z contiguous: bool, 2025-05-07T20:32:16.3635953Z compiled: bool, 2025-05-07T20:32:16.3636028Z ) -> None: 2025-05-07T20:32:16.3636118Z torch.manual_seed(2025) 2025-05-07T20:32:16.3636185Z 2025-05-07T20:32:16.3636350Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3638082Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3638097Z 2025-05-07T20:32:16.3638214Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3638218Z 2025-05-07T20:32:16.3638314Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3638532Z self=, 2025-05-07T20:32:16.3638602Z T=2048, 2025-05-07T20:32:16.3638675Z D=5120, 2025-05-07T20:32:16.3638756Z scale_ub=None, 2025-05-07T20:32:16.3638842Z contiguous=False, 2025-05-07T20:32:16.3638921Z compiled=False, 2025-05-07T20:32:16.3638993Z ) 2025-05-07T20:32:16.3639205Z self = 2025-05-07T20:32:16.3639371Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.3639375Z 2025-05-07T20:32:16.3639453Z @given( 2025-05-07T20:32:16.3639565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3639661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3639850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3639965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3640077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3640146Z ) 2025-05-07T20:32:16.3640382Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3640474Z def test_silu_mul_quant( 2025-05-07T20:32:16.3640547Z self, 2025-05-07T20:32:16.3640703Z T: int, 2025-05-07T20:32:16.3640780Z D: int, 2025-05-07T20:32:16.3640875Z scale_ub: Optional[float], 2025-05-07T20:32:16.3640959Z contiguous: bool, 2025-05-07T20:32:16.3641045Z compiled: bool, 2025-05-07T20:32:16.3641118Z ) -> None: 2025-05-07T20:32:16.3641211Z torch.manual_seed(2025) 2025-05-07T20:32:16.3641278Z 2025-05-07T20:32:16.3641440Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3643181Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3643194Z 2025-05-07T20:32:16.3643307Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3643311Z 2025-05-07T20:32:16.3643413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3643628Z self=, 2025-05-07T20:32:16.3643698Z T=4096, 2025-05-07T20:32:16.3643773Z D=7168, 2025-05-07T20:32:16.3643851Z scale_ub=None, 2025-05-07T20:32:16.3643929Z contiguous=True, 2025-05-07T20:32:16.3644012Z compiled=True, 2025-05-07T20:32:16.3644082Z ) 2025-05-07T20:32:16.3644293Z self = 2025-05-07T20:32:16.3644461Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.3644465Z 2025-05-07T20:32:16.3644538Z @given( 2025-05-07T20:32:16.3644653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3644750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3644860Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3644976Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3645085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3645157Z ) 2025-05-07T20:32:16.3645398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3645486Z def test_silu_mul_quant( 2025-05-07T20:32:16.3645564Z self, 2025-05-07T20:32:16.3645640Z T: int, 2025-05-07T20:32:16.3645713Z D: int, 2025-05-07T20:32:16.3645810Z scale_ub: Optional[float], 2025-05-07T20:32:16.3645897Z contiguous: bool, 2025-05-07T20:32:16.3645980Z compiled: bool, 2025-05-07T20:32:16.3646057Z ) -> None: 2025-05-07T20:32:16.3646149Z torch.manual_seed(2025) 2025-05-07T20:32:16.3646221Z 2025-05-07T20:32:16.3646384Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3648209Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3648217Z 2025-05-07T20:32:16.3648333Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3648338Z 2025-05-07T20:32:16.3648445Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3648704Z self=, 2025-05-07T20:32:16.3648778Z T=2048, 2025-05-07T20:32:16.3648855Z D=5120, 2025-05-07T20:32:16.3649035Z scale_ub=1200.0, 2025-05-07T20:32:16.3649115Z contiguous=False, 2025-05-07T20:32:16.3649197Z compiled=False, 2025-05-07T20:32:16.3649267Z ) 2025-05-07T20:32:16.3649476Z self = 2025-05-07T20:32:16.3649647Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.3649652Z 2025-05-07T20:32:16.3649725Z @given( 2025-05-07T20:32:16.3649842Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3649942Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3650053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3650168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3650278Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3650345Z ) 2025-05-07T20:32:16.3650586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3650680Z def test_silu_mul_quant( 2025-05-07T20:32:16.3650753Z self, 2025-05-07T20:32:16.3650829Z T: int, 2025-05-07T20:32:16.3650903Z D: int, 2025-05-07T20:32:16.3650998Z scale_ub: Optional[float], 2025-05-07T20:32:16.3651086Z contiguous: bool, 2025-05-07T20:32:16.3651167Z compiled: bool, 2025-05-07T20:32:16.3651244Z ) -> None: 2025-05-07T20:32:16.3651335Z torch.manual_seed(2025) 2025-05-07T20:32:16.3651402Z 2025-05-07T20:32:16.3651569Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3653364Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3653374Z 2025-05-07T20:32:16.3653494Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3653498Z 2025-05-07T20:32:16.3653595Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3653809Z self=, 2025-05-07T20:32:16.3653881Z T=4096, 2025-05-07T20:32:16.3653955Z D=7168, 2025-05-07T20:32:16.3654034Z scale_ub=1200.0, 2025-05-07T20:32:16.3654120Z contiguous=True, 2025-05-07T20:32:16.3654199Z compiled=False, 2025-05-07T20:32:16.3654271Z ) 2025-05-07T20:32:16.3654480Z self = 2025-05-07T20:32:16.3654646Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.3654650Z 2025-05-07T20:32:16.3654721Z @given( 2025-05-07T20:32:16.3654837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3654934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3655044Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3655156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3655263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3655335Z ) 2025-05-07T20:32:16.3655572Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3655752Z def test_silu_mul_quant( 2025-05-07T20:32:16.3655827Z self, 2025-05-07T20:32:16.3655901Z T: int, 2025-05-07T20:32:16.3655977Z D: int, 2025-05-07T20:32:16.3656072Z scale_ub: Optional[float], 2025-05-07T20:32:16.3656157Z contiguous: bool, 2025-05-07T20:32:16.3656240Z compiled: bool, 2025-05-07T20:32:16.3656313Z ) -> None: 2025-05-07T20:32:16.3656403Z torch.manual_seed(2025) 2025-05-07T20:32:16.3656548Z 2025-05-07T20:32:16.3656709Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3658498Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3658506Z 2025-05-07T20:32:16.3658625Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3658629Z 2025-05-07T20:32:16.3658729Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3658942Z self=, 2025-05-07T20:32:16.3659022Z T=16384, 2025-05-07T20:32:16.3659096Z D=7168, 2025-05-07T20:32:16.3659445Z scale_ub=None, 2025-05-07T20:32:16.3659574Z contiguous=False, 2025-05-07T20:32:16.3659675Z compiled=True, 2025-05-07T20:32:16.3659749Z ) 2025-05-07T20:32:16.3659959Z self = 2025-05-07T20:32:16.3660132Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3660137Z 2025-05-07T20:32:16.3660212Z @given( 2025-05-07T20:32:16.3660332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3660427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3660538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3660652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3660760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3660832Z ) 2025-05-07T20:32:16.3661073Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3661170Z def test_silu_mul_quant( 2025-05-07T20:32:16.3661241Z self, 2025-05-07T20:32:16.3661321Z T: int, 2025-05-07T20:32:16.3661394Z D: int, 2025-05-07T20:32:16.3661487Z scale_ub: Optional[float], 2025-05-07T20:32:16.3661577Z contiguous: bool, 2025-05-07T20:32:16.3661659Z compiled: bool, 2025-05-07T20:32:16.3661738Z ) -> None: 2025-05-07T20:32:16.3661830Z torch.manual_seed(2025) 2025-05-07T20:32:16.3661903Z 2025-05-07T20:32:16.3662071Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3663826Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3663835Z 2025-05-07T20:32:16.3663951Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3663955Z 2025-05-07T20:32:16.3664052Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3664267Z self=, 2025-05-07T20:32:16.3664488Z T=4096, 2025-05-07T20:32:16.3664565Z D=7168, 2025-05-07T20:32:16.3664643Z scale_ub=None, 2025-05-07T20:32:16.3664726Z contiguous=True, 2025-05-07T20:32:16.3664808Z compiled=False, 2025-05-07T20:32:16.3664881Z ) 2025-05-07T20:32:16.3665090Z self = 2025-05-07T20:32:16.3665255Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3665374Z 2025-05-07T20:32:16.3665454Z @given( 2025-05-07T20:32:16.3665566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3665659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3665769Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3665885Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3665993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3666067Z ) 2025-05-07T20:32:16.3666310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3666403Z def test_silu_mul_quant( 2025-05-07T20:32:16.3666474Z self, 2025-05-07T20:32:16.3666548Z T: int, 2025-05-07T20:32:16.3666622Z D: int, 2025-05-07T20:32:16.3666716Z scale_ub: Optional[float], 2025-05-07T20:32:16.3666800Z contiguous: bool, 2025-05-07T20:32:16.3666886Z compiled: bool, 2025-05-07T20:32:16.3666961Z ) -> None: 2025-05-07T20:32:16.3667057Z torch.manual_seed(2025) 2025-05-07T20:32:16.3667130Z 2025-05-07T20:32:16.3667291Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3669038Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3669044Z 2025-05-07T20:32:16.3669156Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3669161Z 2025-05-07T20:32:16.3669265Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3669485Z self=, 2025-05-07T20:32:16.3669556Z T=16384, 2025-05-07T20:32:16.3669633Z D=7168, 2025-05-07T20:32:16.3669711Z scale_ub=None, 2025-05-07T20:32:16.3669792Z contiguous=True, 2025-05-07T20:32:16.3669876Z compiled=False, 2025-05-07T20:32:16.3669947Z ) 2025-05-07T20:32:16.3670156Z self = 2025-05-07T20:32:16.3670328Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.3670333Z 2025-05-07T20:32:16.3670409Z @given( 2025-05-07T20:32:16.3670524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3670620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3670731Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3670850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3670958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3671035Z ) 2025-05-07T20:32:16.3671278Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3671368Z def test_silu_mul_quant( 2025-05-07T20:32:16.3671439Z self, 2025-05-07T20:32:16.3671516Z T: int, 2025-05-07T20:32:16.3671586Z D: int, 2025-05-07T20:32:16.3671680Z scale_ub: Optional[float], 2025-05-07T20:32:16.3671768Z contiguous: bool, 2025-05-07T20:32:16.3671851Z compiled: bool, 2025-05-07T20:32:16.3671927Z ) -> None: 2025-05-07T20:32:16.3672099Z torch.manual_seed(2025) 2025-05-07T20:32:16.3672171Z 2025-05-07T20:32:16.3672336Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3674075Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3674154Z 2025-05-07T20:32:16.3674271Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3674275Z 2025-05-07T20:32:16.3674370Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3674591Z self=, 2025-05-07T20:32:16.3674670Z T=16384, 2025-05-07T20:32:16.3674739Z D=7168, 2025-05-07T20:32:16.3674819Z scale_ub=1200.0, 2025-05-07T20:32:16.3674904Z contiguous=True, 2025-05-07T20:32:16.3674984Z compiled=False, 2025-05-07T20:32:16.3675059Z ) 2025-05-07T20:32:16.3675270Z self = 2025-05-07T20:32:16.3675444Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.3675448Z 2025-05-07T20:32:16.3675525Z @given( 2025-05-07T20:32:16.3675637Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3675734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3675844Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3675957Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3676068Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3676140Z ) 2025-05-07T20:32:16.3676384Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3676474Z def test_silu_mul_quant( 2025-05-07T20:32:16.3676548Z self, 2025-05-07T20:32:16.3676620Z T: int, 2025-05-07T20:32:16.3676696Z D: int, 2025-05-07T20:32:16.3676792Z scale_ub: Optional[float], 2025-05-07T20:32:16.3676876Z contiguous: bool, 2025-05-07T20:32:16.3676965Z compiled: bool, 2025-05-07T20:32:16.3677041Z ) -> None: 2025-05-07T20:32:16.3677130Z torch.manual_seed(2025) 2025-05-07T20:32:16.3677201Z 2025-05-07T20:32:16.3677362Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3679108Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3679114Z 2025-05-07T20:32:16.3679226Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3679231Z 2025-05-07T20:32:16.3679336Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3679552Z self=, 2025-05-07T20:32:16.3679628Z T=128, 2025-05-07T20:32:16.3679708Z D=5120, 2025-05-07T20:32:16.3679786Z scale_ub=1200.0, 2025-05-07T20:32:16.3679868Z contiguous=False, 2025-05-07T20:32:16.3679950Z compiled=False, 2025-05-07T20:32:16.3680021Z ) 2025-05-07T20:32:16.3680231Z self = 2025-05-07T20:32:16.3680488Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.3680493Z 2025-05-07T20:32:16.3680568Z @given( 2025-05-07T20:32:16.3680687Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3680783Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3680894Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3681009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3681222Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3681295Z ) 2025-05-07T20:32:16.3681537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3681629Z def test_silu_mul_quant( 2025-05-07T20:32:16.3681704Z self, 2025-05-07T20:32:16.3681782Z T: int, 2025-05-07T20:32:16.3681854Z D: int, 2025-05-07T20:32:16.3681948Z scale_ub: Optional[float], 2025-05-07T20:32:16.3682039Z contiguous: bool, 2025-05-07T20:32:16.3682126Z compiled: bool, 2025-05-07T20:32:16.3682205Z ) -> None: 2025-05-07T20:32:16.3682295Z torch.manual_seed(2025) 2025-05-07T20:32:16.3682363Z 2025-05-07T20:32:16.3682529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3682600Z 2025-05-07T20:32:16.3682688Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3682815Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3682907Z x = x_sign * x_clamp 2025-05-07T20:32:16.3682986Z x0 = x[:, :D] 2025-05-07T20:32:16.3683068Z x1 = x[:, D:] 2025-05-07T20:32:16.3683140Z 2025-05-07T20:32:16.3683221Z if contiguous: 2025-05-07T20:32:16.3683313Z x0 = x0.contiguous() 2025-05-07T20:32:16.3683400Z x1 = x1.contiguous() 2025-05-07T20:32:16.3683473Z 2025-05-07T20:32:16.3683560Z if scale_ub is not None: 2025-05-07T20:32:16.3683663Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3683799Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3683869Z ) 2025-05-07T20:32:16.3683943Z else: 2025-05-07T20:32:16.3684039Z scale_ub_tensor = None 2025-05-07T20:32:16.3684108Z 2025-05-07T20:32:16.3684233Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3684323Z op = silu_mul_quant 2025-05-07T20:32:16.3684404Z if compiled: 2025-05-07T20:32:16.3684506Z op = torch.compile(op) 2025-05-07T20:32:16.3684610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3684683Z 2025-05-07T20:32:16.3684774Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3684778Z 2025-05-07T20:32:16.3684872Z moe/activation_test.py:117: 2025-05-07T20:32:16.3684996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3685094Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3685191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3685686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3685785Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3686137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3686355Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3686693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3686785Z kernel = self.compile( 2025-05-07T20:32:16.3687162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3687335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3687459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3687468Z 2025-05-07T20:32:16.3687749Z self = 2025-05-07T20:32:16.3688570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3689066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a41fc7560>} 2025-05-07T20:32:16.3689876Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3690064Z context = 2025-05-07T20:32:16.3690069Z 2025-05-07T20:32:16.3690232Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3690486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3690591Z module_map=module_map) 2025-05-07T20:32:16.3690751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3690845Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3690923Z E ^ 2025-05-07T20:32:16.3691272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3691276Z 2025-05-07T20:32:16.3691683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3691688Z 2025-05-07T20:32:16.3691788Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3692004Z self=, 2025-05-07T20:32:16.3692083Z T=2048, 2025-05-07T20:32:16.3692161Z D=7168, 2025-05-07T20:32:16.3692244Z scale_ub=None, 2025-05-07T20:32:16.3692327Z contiguous=False, 2025-05-07T20:32:16.3692408Z compiled=False, 2025-05-07T20:32:16.3692479Z ) 2025-05-07T20:32:16.3692692Z self = 2025-05-07T20:32:16.3692859Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.3692864Z 2025-05-07T20:32:16.3692949Z @given( 2025-05-07T20:32:16.3693116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3693213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3693326Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3693439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3693550Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3693617Z ) 2025-05-07T20:32:16.3693857Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3693954Z def test_silu_mul_quant( 2025-05-07T20:32:16.3694029Z self, 2025-05-07T20:32:16.3694102Z T: int, 2025-05-07T20:32:16.3694176Z D: int, 2025-05-07T20:32:16.3694272Z scale_ub: Optional[float], 2025-05-07T20:32:16.3694358Z contiguous: bool, 2025-05-07T20:32:16.3694448Z compiled: bool, 2025-05-07T20:32:16.3694521Z ) -> None: 2025-05-07T20:32:16.3694611Z torch.manual_seed(2025) 2025-05-07T20:32:16.3694687Z 2025-05-07T20:32:16.3694852Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3696688Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3696695Z 2025-05-07T20:32:16.3696808Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3696812Z 2025-05-07T20:32:16.3696914Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3697131Z self=, 2025-05-07T20:32:16.3697281Z T=128, 2025-05-07T20:32:16.3697362Z D=7168, 2025-05-07T20:32:16.3697444Z scale_ub=1200.0, 2025-05-07T20:32:16.3697524Z contiguous=True, 2025-05-07T20:32:16.3697606Z compiled=True, 2025-05-07T20:32:16.3697677Z ) 2025-05-07T20:32:16.3697888Z self = 2025-05-07T20:32:16.3698053Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.3698058Z 2025-05-07T20:32:16.3698131Z @given( 2025-05-07T20:32:16.3698256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3698353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3698463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3698578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3698686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3698757Z ) 2025-05-07T20:32:16.3699004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3699095Z def test_silu_mul_quant( 2025-05-07T20:32:16.3699170Z self, 2025-05-07T20:32:16.3699247Z T: int, 2025-05-07T20:32:16.3699321Z D: int, 2025-05-07T20:32:16.3699419Z scale_ub: Optional[float], 2025-05-07T20:32:16.3699505Z contiguous: bool, 2025-05-07T20:32:16.3699586Z compiled: bool, 2025-05-07T20:32:16.3699662Z ) -> None: 2025-05-07T20:32:16.3699750Z torch.manual_seed(2025) 2025-05-07T20:32:16.3699823Z 2025-05-07T20:32:16.3699990Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3700061Z 2025-05-07T20:32:16.3700149Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3700275Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3700359Z x = x_sign * x_clamp 2025-05-07T20:32:16.3700437Z x0 = x[:, :D] 2025-05-07T20:32:16.3700517Z x1 = x[:, D:] 2025-05-07T20:32:16.3700592Z 2025-05-07T20:32:16.3700672Z if contiguous: 2025-05-07T20:32:16.3700763Z x0 = x0.contiguous() 2025-05-07T20:32:16.3700848Z x1 = x1.contiguous() 2025-05-07T20:32:16.3700922Z 2025-05-07T20:32:16.3701009Z if scale_ub is not None: 2025-05-07T20:32:16.3701110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3701242Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3701314Z ) 2025-05-07T20:32:16.3701388Z else: 2025-05-07T20:32:16.3701488Z scale_ub_tensor = None 2025-05-07T20:32:16.3701560Z 2025-05-07T20:32:16.3701685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3701773Z op = silu_mul_quant 2025-05-07T20:32:16.3701853Z if compiled: 2025-05-07T20:32:16.3701948Z op = torch.compile(op) 2025-05-07T20:32:16.3702055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3702131Z 2025-05-07T20:32:16.3702220Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3702225Z 2025-05-07T20:32:16.3702318Z moe/activation_test.py:117: 2025-05-07T20:32:16.3702443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3702544Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3702641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3703003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3703177Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3703662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3703761Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3704109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3704327Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3704734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3704826Z kernel = self.compile( 2025-05-07T20:32:16.3705200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3705372Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3705500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3705504Z 2025-05-07T20:32:16.3705708Z self = 2025-05-07T20:32:16.3706470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3706975Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f6a41e30e00>} 2025-05-07T20:32:16.3707705Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3707895Z context = 2025-05-07T20:32:16.3707899Z 2025-05-07T20:32:16.3708066Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3708338Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3708458Z module_map=module_map) 2025-05-07T20:32:16.3708637Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3708731Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3708814Z E ^ 2025-05-07T20:32:16.3709163Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3709167Z 2025-05-07T20:32:16.3709568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3709576Z 2025-05-07T20:32:16.3709676Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3714091Z self=, 2025-05-07T20:32:16.3714179Z T=128, 2025-05-07T20:32:16.3714261Z D=7168, 2025-05-07T20:32:16.3714342Z scale_ub=1200.0, 2025-05-07T20:32:16.3714422Z contiguous=True, 2025-05-07T20:32:16.3714508Z compiled=False, 2025-05-07T20:32:16.3714579Z ) 2025-05-07T20:32:16.3714797Z self = 2025-05-07T20:32:16.3714970Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:16.3714983Z 2025-05-07T20:32:16.3715059Z @given( 2025-05-07T20:32:16.3715177Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3715280Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3715394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3715512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3715623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3715698Z ) 2025-05-07T20:32:16.3716051Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3716149Z def test_silu_mul_quant( 2025-05-07T20:32:16.3716225Z self, 2025-05-07T20:32:16.3716304Z T: int, 2025-05-07T20:32:16.3716380Z D: int, 2025-05-07T20:32:16.3716479Z scale_ub: Optional[float], 2025-05-07T20:32:16.3716574Z contiguous: bool, 2025-05-07T20:32:16.3716658Z compiled: bool, 2025-05-07T20:32:16.3716814Z ) -> None: 2025-05-07T20:32:16.3716914Z torch.manual_seed(2025) 2025-05-07T20:32:16.3716984Z 2025-05-07T20:32:16.3717156Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3717228Z 2025-05-07T20:32:16.3717318Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3717449Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3719270Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3719281Z 2025-05-07T20:32:16.3719401Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:16.3719406Z 2025-05-07T20:32:16.3719506Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3719727Z self=, 2025-05-07T20:32:16.3719804Z T=128, 2025-05-07T20:32:16.3719880Z D=5120, 2025-05-07T20:32:16.3719959Z scale_ub=1200.0, 2025-05-07T20:32:16.3720047Z contiguous=True, 2025-05-07T20:32:16.3720126Z compiled=True, 2025-05-07T20:32:16.3720197Z ) 2025-05-07T20:32:16.3720415Z self = 2025-05-07T20:32:16.3720580Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:16.3720584Z 2025-05-07T20:32:16.3720661Z @given( 2025-05-07T20:32:16.3720774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3720869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3720981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3721098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3721208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3721286Z ) 2025-05-07T20:32:16.3721526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3721621Z def test_silu_mul_quant( 2025-05-07T20:32:16.3721693Z self, 2025-05-07T20:32:16.3721769Z T: int, 2025-05-07T20:32:16.3721847Z D: int, 2025-05-07T20:32:16.3721947Z scale_ub: Optional[float], 2025-05-07T20:32:16.3722035Z contiguous: bool, 2025-05-07T20:32:16.3722121Z compiled: bool, 2025-05-07T20:32:16.3722197Z ) -> None: 2025-05-07T20:32:16.3722290Z torch.manual_seed(2025) 2025-05-07T20:32:16.3722360Z 2025-05-07T20:32:16.3722524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3722595Z 2025-05-07T20:32:16.3722687Z > x_sign = torch.sign(x) 2025-05-07T20:32:16.3724509Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3724516Z 2025-05-07T20:32:16.3724635Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:16.3724640Z 2025-05-07T20:32:16.3724739Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3724958Z self=, 2025-05-07T20:32:16.3725030Z T=128, 2025-05-07T20:32:16.3725103Z D=7168, 2025-05-07T20:32:16.3725262Z scale_ub=None, 2025-05-07T20:32:16.3725345Z contiguous=True, 2025-05-07T20:32:16.3725425Z compiled=True, 2025-05-07T20:32:16.3725497Z ) 2025-05-07T20:32:16.3725708Z self = 2025-05-07T20:32:16.3725868Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.3725878Z 2025-05-07T20:32:16.3725951Z @given( 2025-05-07T20:32:16.3726065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3726171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3726282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3726398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3726511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3726585Z ) 2025-05-07T20:32:16.3726822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3726914Z def test_silu_mul_quant( 2025-05-07T20:32:16.3726994Z self, 2025-05-07T20:32:16.3727069Z T: int, 2025-05-07T20:32:16.3727146Z D: int, 2025-05-07T20:32:16.3727240Z scale_ub: Optional[float], 2025-05-07T20:32:16.3727329Z contiguous: bool, 2025-05-07T20:32:16.3727413Z compiled: bool, 2025-05-07T20:32:16.3727486Z ) -> None: 2025-05-07T20:32:16.3727579Z torch.manual_seed(2025) 2025-05-07T20:32:16.3727649Z 2025-05-07T20:32:16.3727811Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3729606Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:16.3729616Z 2025-05-07T20:32:16.3729729Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:16.3729863Z =============================== warnings summary =============================== 2025-05-07T20:32:16.3730167Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:16.3730467Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:16.3730762Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:16.3731625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:16.3731858Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:16.3731863Z 2025-05-07T20:32:16.3732036Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:32:16.3733425Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:32:16.3733613Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:32:16.3733618Z 2025-05-07T20:32:16.3733825Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:16.3733991Z ================== 1 failed, 1 passed, 13 warnings in 20.32s =================== 2025-05-07T20:32:18.0898110Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:18.1516326Z 2025-05-07T20:32:18.1516700Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:32:18.1517056Z 2025-05-07T20:32:18.1517063Z 2025-05-07T20:32:18.1538826Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:20.3100239Z ============================= test session starts ============================== 2025-05-07T20:32:20.3100881Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:20.3101397Z cachedir: .pytest_cache 2025-05-07T20:32:20.3101960Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:20.3102691Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:20.3103094Z plugins: hypothesis-6.131.14 2025-05-07T20:32:21.9279186Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:22.0387083Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:22.0387491Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:22.0387716Z 2025-05-07T20:32:24.1160932Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:24.1162165Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:24.1163493Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:24.1164934Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:24.1165904Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.1167191Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:24.1168557Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.1169532Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.1170743Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:24.1172453Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.1173608Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.1174868Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:24.1176275Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:24.1177484Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:24.1178666Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:24.1179486Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.1180493Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:24.1181675Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:24.1182453Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:24.1183648Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:24.1184910Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:24.1186006Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:24.1187043Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:24.1188195Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:24.1189529Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:24.1190581Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.1191479Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.1192209Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:24.1193205Z W0507 20:32:24.113000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.1320853Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:24.1321904Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:24.1323218Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:24.1324738Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:24.1325694Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.1326985Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:24.1328346Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.1329317Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.1330527Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:24.1331884Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.1332937Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.1334297Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:24.1335528Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:24.1336725Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:24.1337923Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:24.1338734Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:24.1339743Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:24.1340750Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:24.1341531Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:24.1342714Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:24.1344073Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:24.1345175Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:24.1346208Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:24.1347451Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:24.1348778Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:24.1349831Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.1350735Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:24.1351506Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:24.1352518Z W0507 20:32:24.130000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.5513789Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.5514427Z self=, 2025-05-07T20:32:24.5514840Z T=1, 2025-05-07T20:32:24.5515041Z D=5120, 2025-05-07T20:32:24.5515247Z scale_ub=None, 2025-05-07T20:32:24.5515470Z contiguous=True, 2025-05-07T20:32:24.5515701Z compiled=True, 2025-05-07T20:32:24.5515905Z ) 2025-05-07T20:32:24.5516231Z self = 2025-05-07T20:32:24.5516719Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:24.5516978Z 2025-05-07T20:32:24.5517064Z @given( 2025-05-07T20:32:24.5517308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:24.5517646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:24.5517956Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:24.5518289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:24.5518612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:24.5518899Z ) 2025-05-07T20:32:24.5519251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:24.5519692Z def test_silu_mul_quant( 2025-05-07T20:32:24.5519933Z self, 2025-05-07T20:32:24.5520129Z T: int, 2025-05-07T20:32:24.5520349Z D: int, 2025-05-07T20:32:24.5520575Z scale_ub: Optional[float], 2025-05-07T20:32:24.5520853Z contiguous: bool, 2025-05-07T20:32:24.5521089Z compiled: bool, 2025-05-07T20:32:24.5521325Z ) -> None: 2025-05-07T20:32:24.5521546Z torch.manual_seed(2025) 2025-05-07T20:32:24.5521792Z 2025-05-07T20:32:24.5522074Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:24.5522417Z 2025-05-07T20:32:24.5522609Z x_sign = torch.sign(x) 2025-05-07T20:32:24.5522908Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:24.5523224Z x = x_sign * x_clamp 2025-05-07T20:32:24.5523462Z x0 = x[:, :D] 2025-05-07T20:32:24.5523683Z x1 = x[:, D:] 2025-05-07T20:32:24.5523894Z 2025-05-07T20:32:24.5524080Z if contiguous: 2025-05-07T20:32:24.5524583Z x0 = x0.contiguous() 2025-05-07T20:32:24.5524851Z x1 = x1.contiguous() 2025-05-07T20:32:24.5525088Z 2025-05-07T20:32:24.5525284Z if scale_ub is not None: 2025-05-07T20:32:24.5525557Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:24.5525894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:24.5526198Z ) 2025-05-07T20:32:24.5526392Z else: 2025-05-07T20:32:24.5526764Z scale_ub_tensor = None 2025-05-07T20:32:24.5527017Z 2025-05-07T20:32:24.5527253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.5527567Z op = silu_mul_quant 2025-05-07T20:32:24.5527814Z if compiled: 2025-05-07T20:32:24.5528065Z op = torch.compile(op) 2025-05-07T20:32:24.5528364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:24.5528635Z 2025-05-07T20:32:24.5528832Z y_fp8, y_scale = fn() 2025-05-07T20:32:24.5529125Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:24.5529413Z 2025-05-07T20:32:24.5529654Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:24.5529990Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:24.5530280Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:24.5530594Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:24.5530953Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.5531271Z 2025-05-07T20:32:24.5531477Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:24.5531674Z 2025-05-07T20:32:24.5531779Z moe/activation_test.py:126: 2025-05-07T20:32:24.5532078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5532408Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:24.5532736Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:24.5533592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:24.5534347Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:24.5534891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:24.5535571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:24.5536268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:24.5536980Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:24.5537705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:24.5538341Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:24.5538945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:24.5539454Z fn() 2025-05-07T20:32:24.5539967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:24.5540550Z self.fn.run( 2025-05-07T20:32:24.5541024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:24.5541548Z kernel = self.compile( 2025-05-07T20:32:24.5542097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:24.5542751Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:24.5543144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:24.5543378Z 2025-05-07T20:32:24.5543588Z self = 2025-05-07T20:32:24.5544757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:24.5546136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1098d4c20>} 2025-05-07T20:32:24.5547461Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:24.5548551Z context = 2025-05-07T20:32:24.5548841Z 2025-05-07T20:32:24.5549005Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:24.5549525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:24.5549999Z module_map=module_map) 2025-05-07T20:32:24.5550361Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:24.5550726Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:24.5551001Z E ^ 2025-05-07T20:32:24.5551489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:24.5551969Z 2025-05-07T20:32:24.5552386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:24.5552896Z 2025-05-07T20:32:24.5553002Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:24.5553420Z self=, 2025-05-07T20:32:24.5553815Z T=2048, 2025-05-07T20:32:24.5554008Z D=5120, 2025-05-07T20:32:24.5554210Z scale_ub=1200.0, 2025-05-07T20:32:24.5554432Z contiguous=True, 2025-05-07T20:32:24.5554658Z compiled=False, 2025-05-07T20:32:24.5554869Z ) 2025-05-07T20:32:25.0007414Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:25.0009550Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:25.0012173Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:25.0013727Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:25.0014701Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.0015990Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:25.0017356Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.0018333Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.0019536Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.0021265Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.0022359Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.0023798Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:25.0025026Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:25.0026233Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:25.0027418Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:25.0028232Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.0029245Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:25.0030248Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:25.0031022Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:25.0032273Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:25.0033530Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:25.0034638Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:25.0035674Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:25.0036828Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:25.0038163Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:25.0039208Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.0040104Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.0040830Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:25.0041831Z W0507 20:32:24.996000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.0901010Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:25.0902084Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:25.0903404Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:25.0904968Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:25.0905926Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.0907231Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:25.0908591Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.0909563Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.0910785Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.0912132Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.0913193Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.0914460Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:25.0915695Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:25.0916896Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:25.0918084Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:25.0918901Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.0919910Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:25.0920917Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:25.0921717Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:25.0923020Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:25.0924292Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:25.0925391Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:25.0926500Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:25.0927655Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:25.0928996Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:25.0930042Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.0930936Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.0931669Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:25.0932726Z W0507 20:32:25.087000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.5499311Z self = 2025-05-07T20:32:25.5499862Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:25.5500144Z 2025-05-07T20:32:25.5500236Z @given( 2025-05-07T20:32:25.5500497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.5500828Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.5501149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.5501484Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.5501820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.5502119Z ) 2025-05-07T20:32:25.5502485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.5502934Z def test_silu_mul_quant( 2025-05-07T20:32:25.5503191Z self, 2025-05-07T20:32:25.5503399Z T: int, 2025-05-07T20:32:25.5503601Z D: int, 2025-05-07T20:32:25.5503835Z scale_ub: Optional[float], 2025-05-07T20:32:25.5504114Z contiguous: bool, 2025-05-07T20:32:25.5504359Z compiled: bool, 2025-05-07T20:32:25.5504593Z ) -> None: 2025-05-07T20:32:25.5504823Z torch.manual_seed(2025) 2025-05-07T20:32:25.5505067Z 2025-05-07T20:32:25.5505353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.5505702Z 2025-05-07T20:32:25.5505900Z x_sign = torch.sign(x) 2025-05-07T20:32:25.5506206Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.5506523Z x = x_sign * x_clamp 2025-05-07T20:32:25.5506770Z x0 = x[:, :D] 2025-05-07T20:32:25.5506997Z x1 = x[:, D:] 2025-05-07T20:32:25.5507228Z 2025-05-07T20:32:25.5507422Z if contiguous: 2025-05-07T20:32:25.5507669Z x0 = x0.contiguous() 2025-05-07T20:32:25.5507937Z x1 = x1.contiguous() 2025-05-07T20:32:25.5508177Z 2025-05-07T20:32:25.5508386Z if scale_ub is not None: 2025-05-07T20:32:25.5508678Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.5509022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.5509358Z ) 2025-05-07T20:32:25.5509562Z else: 2025-05-07T20:32:25.5510000Z scale_ub_tensor = None 2025-05-07T20:32:25.5517402Z 2025-05-07T20:32:25.5517660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.5517978Z op = silu_mul_quant 2025-05-07T20:32:25.5518232Z if compiled: 2025-05-07T20:32:25.5518485Z op = torch.compile(op) 2025-05-07T20:32:25.5518779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5519247Z 2025-05-07T20:32:25.5519448Z > y_fp8, y_scale = fn() 2025-05-07T20:32:25.5519612Z 2025-05-07T20:32:25.5519715Z moe/activation_test.py:117: 2025-05-07T20:32:25.5520022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5520359Z moe/activation_test.py:115: in fn 2025-05-07T20:32:25.5520639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5521329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:25.5522074Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:25.5522607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.5523279Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.5523946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.5524481Z kernel = self.compile( 2025-05-07T20:32:25.5525025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.5525671Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.5526068Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5526294Z 2025-05-07T20:32:25.5526505Z self = 2025-05-07T20:32:25.5527576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.5528933Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb109990180>} 2025-05-07T20:32:25.5530261Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.5531276Z context = 2025-05-07T20:32:25.5531561Z 2025-05-07T20:32:25.5531732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.5532241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.5532708Z module_map=module_map) 2025-05-07T20:32:25.5533163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.5533513Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.5533772Z E ^ 2025-05-07T20:32:25.5534234Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.5534681Z 2025-05-07T20:32:25.5535097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.5535600Z 2025-05-07T20:32:25.5535708Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.5536109Z self=, 2025-05-07T20:32:25.5536510Z T=2048, 2025-05-07T20:32:25.5536703Z D=5120, 2025-05-07T20:32:25.5536889Z scale_ub=1200.0, 2025-05-07T20:32:25.5537111Z contiguous=True, 2025-05-07T20:32:25.5537419Z compiled=True, 2025-05-07T20:32:25.5537621Z ) 2025-05-07T20:32:25.5537938Z self = 2025-05-07T20:32:25.5538433Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:25.5538698Z 2025-05-07T20:32:25.5538777Z @given( 2025-05-07T20:32:25.5539010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:25.5539394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:25.5539705Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:25.5540026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:25.5540353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:25.5540640Z ) 2025-05-07T20:32:25.5540981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:25.5541423Z def test_silu_mul_quant( 2025-05-07T20:32:25.5541669Z self, 2025-05-07T20:32:25.5541864Z T: int, 2025-05-07T20:32:25.5542067Z D: int, 2025-05-07T20:32:25.5542289Z scale_ub: Optional[float], 2025-05-07T20:32:25.5542553Z contiguous: bool, 2025-05-07T20:32:25.5542792Z compiled: bool, 2025-05-07T20:32:25.5543014Z ) -> None: 2025-05-07T20:32:25.5543225Z torch.manual_seed(2025) 2025-05-07T20:32:25.5543474Z 2025-05-07T20:32:25.5543750Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:25.5544101Z 2025-05-07T20:32:25.5544290Z x_sign = torch.sign(x) 2025-05-07T20:32:25.5544587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:25.5544903Z x = x_sign * x_clamp 2025-05-07T20:32:25.5545142Z x0 = x[:, :D] 2025-05-07T20:32:25.5545369Z x1 = x[:, D:] 2025-05-07T20:32:25.5545585Z 2025-05-07T20:32:25.5545766Z if contiguous: 2025-05-07T20:32:25.5546005Z x0 = x0.contiguous() 2025-05-07T20:32:25.5546263Z x1 = x1.contiguous() 2025-05-07T20:32:25.5546506Z 2025-05-07T20:32:25.5546699Z if scale_ub is not None: 2025-05-07T20:32:25.5546979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:25.5547308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:25.5547619Z ) 2025-05-07T20:32:25.5547814Z else: 2025-05-07T20:32:25.5548019Z scale_ub_tensor = None 2025-05-07T20:32:25.5548270Z 2025-05-07T20:32:25.5548510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.5548817Z op = silu_mul_quant 2025-05-07T20:32:25.5549072Z if compiled: 2025-05-07T20:32:25.5549321Z op = torch.compile(op) 2025-05-07T20:32:25.5549620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:25.5549890Z 2025-05-07T20:32:25.5550089Z y_fp8, y_scale = fn() 2025-05-07T20:32:25.5550381Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:25.5550660Z 2025-05-07T20:32:25.5550905Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:25.5551242Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:25.5551526Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:25.5551844Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:25.5552208Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.5552514Z 2025-05-07T20:32:25.5552720Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:25.5552917Z 2025-05-07T20:32:25.5553015Z moe/activation_test.py:126: 2025-05-07T20:32:25.5553318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5553650Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:25.5553987Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:25.5554772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:25.5555595Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:25.5556143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:25.5556820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:25.5557499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:25.5558279Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:25.5559000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:25.5560021Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:25.5560624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:25.5561126Z fn() 2025-05-07T20:32:25.5561639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:25.5562220Z self.fn.run( 2025-05-07T20:32:25.5562730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:25.5563264Z kernel = self.compile( 2025-05-07T20:32:25.5563802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:25.5564456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.5564845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:25.5565076Z 2025-05-07T20:32:25.5565281Z self = 2025-05-07T20:32:25.5566358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:25.5567713Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10852d260>} 2025-05-07T20:32:25.5569035Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:25.5570059Z context = 2025-05-07T20:32:25.5570350Z 2025-05-07T20:32:25.5570514Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:25.5571034Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.5571493Z module_map=module_map) 2025-05-07T20:32:25.5571862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.5572220Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:25.5572482Z E ^ 2025-05-07T20:32:25.5572949Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.5573480Z 2025-05-07T20:32:25.5573891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:25.5574406Z 2025-05-07T20:32:25.5574516Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:25.5574923Z self=, 2025-05-07T20:32:25.5575323Z T=16384, 2025-05-07T20:32:25.5575525Z D=7168, 2025-05-07T20:32:25.5575721Z scale_ub=1200.0, 2025-05-07T20:32:25.5575940Z contiguous=False, 2025-05-07T20:32:25.5576168Z compiled=False, 2025-05-07T20:32:25.5576372Z ) 2025-05-07T20:32:25.8020796Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:25.8021919Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:25.8023263Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:25.8024928Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:25.8025890Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.8027181Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:25.8028548Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.8029528Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.8030739Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.8032100Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.8033155Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.8034416Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:25.8035651Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:25.8036859Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:25.8038048Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:25.8038863Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.8039868Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:25.8040882Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:25.8041669Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:25.8042937Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:25.8044209Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:25.8045311Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:25.8046409Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:25.8047556Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:25.8048900Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:25.8049951Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.8050848Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.8051584Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:25.8052579Z W0507 20:32:25.798000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:25.8637799Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:25.8639113Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:25.8640421Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:25.8641833Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:25.8642849Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.8644134Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:25.8645488Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:25.8646455Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.8647664Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:25.8649017Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:25.8650273Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.8651531Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:25.8652897Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:25.8654172Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:25.8655372Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:25.8656185Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:25.8657194Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:25.8658193Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:25.8658978Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:25.8660329Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:25.8661591Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:25.8662689Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:25.8663711Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:25.8664875Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:25.8666202Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:25.8667256Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:25.8668152Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:25.8668887Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:25.8669889Z W0507 20:32:25.860000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.3637827Z self = 2025-05-07T20:32:26.3638346Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:26.3638629Z 2025-05-07T20:32:26.3638716Z @given( 2025-05-07T20:32:26.3638980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.3639466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.3639792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.3640129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.3640449Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.3640733Z ) 2025-05-07T20:32:26.3641085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.3641639Z def test_silu_mul_quant( 2025-05-07T20:32:26.3641882Z self, 2025-05-07T20:32:26.3642111Z T: int, 2025-05-07T20:32:26.3642323Z D: int, 2025-05-07T20:32:26.3642545Z scale_ub: Optional[float], 2025-05-07T20:32:26.3642820Z contiguous: bool, 2025-05-07T20:32:26.3643063Z compiled: bool, 2025-05-07T20:32:26.3643280Z ) -> None: 2025-05-07T20:32:26.3643499Z torch.manual_seed(2025) 2025-05-07T20:32:26.3643743Z 2025-05-07T20:32:26.3644021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.3644362Z 2025-05-07T20:32:26.3644557Z x_sign = torch.sign(x) 2025-05-07T20:32:26.3644841Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.3645150Z x = x_sign * x_clamp 2025-05-07T20:32:26.3645400Z x0 = x[:, :D] 2025-05-07T20:32:26.3645611Z x1 = x[:, D:] 2025-05-07T20:32:26.3645830Z 2025-05-07T20:32:26.3646031Z if contiguous: 2025-05-07T20:32:26.3646265Z x0 = x0.contiguous() 2025-05-07T20:32:26.3646524Z x1 = x1.contiguous() 2025-05-07T20:32:26.3646769Z 2025-05-07T20:32:26.3646959Z if scale_ub is not None: 2025-05-07T20:32:26.3647240Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.3647582Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.3647895Z ) 2025-05-07T20:32:26.3648085Z else: 2025-05-07T20:32:26.3648300Z scale_ub_tensor = None 2025-05-07T20:32:26.3648555Z 2025-05-07T20:32:26.3648796Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.3649113Z op = silu_mul_quant 2025-05-07T20:32:26.3649365Z if compiled: 2025-05-07T20:32:26.3649610Z op = torch.compile(op) 2025-05-07T20:32:26.3649907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.3650180Z 2025-05-07T20:32:26.3650371Z > y_fp8, y_scale = fn() 2025-05-07T20:32:26.3650550Z 2025-05-07T20:32:26.3650648Z moe/activation_test.py:117: 2025-05-07T20:32:26.3650944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.3651264Z moe/activation_test.py:115: in fn 2025-05-07T20:32:26.3651545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.3652232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:26.3652913Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:26.3653534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.3654210Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.3654868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.3655393Z kernel = self.compile( 2025-05-07T20:32:26.3655929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.3656575Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.3656976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.3657200Z 2025-05-07T20:32:26.3657406Z self = 2025-05-07T20:32:26.3658561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.3660067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10852f060>} 2025-05-07T20:32:26.3661390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.3662566Z context = 2025-05-07T20:32:26.3662847Z 2025-05-07T20:32:26.3663014Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.3663531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.3663993Z module_map=module_map) 2025-05-07T20:32:26.3664356Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.3664712Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.3664974Z E ^ 2025-05-07T20:32:26.3665440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.3665887Z 2025-05-07T20:32:26.3666299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.3666815Z 2025-05-07T20:32:26.3666921Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.3667337Z self=, 2025-05-07T20:32:26.3667743Z T=1, 2025-05-07T20:32:26.3667929Z D=7168, 2025-05-07T20:32:26.3668127Z scale_ub=None, 2025-05-07T20:32:26.3668343Z contiguous=True, 2025-05-07T20:32:26.3668563Z compiled=True, 2025-05-07T20:32:26.3668779Z ) 2025-05-07T20:32:26.3669105Z self = 2025-05-07T20:32:26.3669612Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:26.3669868Z 2025-05-07T20:32:26.3669947Z @given( 2025-05-07T20:32:26.3670180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:26.3670492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:26.3670797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:26.3671131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:26.3671461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:26.3671751Z ) 2025-05-07T20:32:26.3672101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:26.3672594Z def test_silu_mul_quant( 2025-05-07T20:32:26.3672838Z self, 2025-05-07T20:32:26.3673036Z T: int, 2025-05-07T20:32:26.3673250Z D: int, 2025-05-07T20:32:26.3673479Z scale_ub: Optional[float], 2025-05-07T20:32:26.3673746Z contiguous: bool, 2025-05-07T20:32:26.3673996Z compiled: bool, 2025-05-07T20:32:26.3674227Z ) -> None: 2025-05-07T20:32:26.3674438Z torch.manual_seed(2025) 2025-05-07T20:32:26.3674689Z 2025-05-07T20:32:26.3674962Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:26.3675304Z 2025-05-07T20:32:26.3675515Z x_sign = torch.sign(x) 2025-05-07T20:32:26.3675825Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:26.3676132Z x = x_sign * x_clamp 2025-05-07T20:32:26.3676382Z x0 = x[:, :D] 2025-05-07T20:32:26.3676611Z x1 = x[:, D:] 2025-05-07T20:32:26.3676815Z 2025-05-07T20:32:26.3677010Z if contiguous: 2025-05-07T20:32:26.3677255Z x0 = x0.contiguous() 2025-05-07T20:32:26.3677517Z x1 = x1.contiguous() 2025-05-07T20:32:26.3677751Z 2025-05-07T20:32:26.3677944Z if scale_ub is not None: 2025-05-07T20:32:26.3678343Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:26.3678681Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:26.3678995Z ) 2025-05-07T20:32:26.3679196Z else: 2025-05-07T20:32:26.3679407Z scale_ub_tensor = None 2025-05-07T20:32:26.3679661Z 2025-05-07T20:32:26.3679900Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.3680287Z op = silu_mul_quant 2025-05-07T20:32:26.3680543Z if compiled: 2025-05-07T20:32:26.3680796Z op = torch.compile(op) 2025-05-07T20:32:26.3681087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:26.3681369Z 2025-05-07T20:32:26.3681567Z y_fp8, y_scale = fn() 2025-05-07T20:32:26.3681847Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:26.3682139Z 2025-05-07T20:32:26.3682388Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:26.3682733Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:26.3683024Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:26.3683345Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:26.3683708Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.3684015Z 2025-05-07T20:32:26.3684227Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:26.3684421Z 2025-05-07T20:32:26.3684529Z moe/activation_test.py:126: 2025-05-07T20:32:26.3684829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.3685167Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:26.3685499Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:26.3686290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:26.3687035Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:26.3687591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:26.3688268Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:26.3688952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:26.3689662Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:26.3690390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:26.3691025Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:26.3691615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:26.3692142Z fn() 2025-05-07T20:32:26.3692809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:26.3693455Z self.fn.run( 2025-05-07T20:32:26.3693920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:26.3694461Z kernel = self.compile( 2025-05-07T20:32:26.3695003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:26.3695654Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.3696058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:26.3696288Z 2025-05-07T20:32:26.3696495Z self = 2025-05-07T20:32:26.3697564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:26.3699053Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb103820fe0>} 2025-05-07T20:32:26.3700378Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:26.3701466Z context = 2025-05-07T20:32:26.3701752Z 2025-05-07T20:32:26.3701924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:26.3702439Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.3702901Z module_map=module_map) 2025-05-07T20:32:26.3703372Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.3703796Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:26.3711251Z E ^ 2025-05-07T20:32:26.3711738Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:26.3712184Z 2025-05-07T20:32:26.3712609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:26.3713117Z 2025-05-07T20:32:26.3713225Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:26.3713644Z self=, 2025-05-07T20:32:26.3714041Z T=4096, 2025-05-07T20:32:26.3714235Z D=5120, 2025-05-07T20:32:26.3714426Z scale_ub=None, 2025-05-07T20:32:26.3714643Z contiguous=False, 2025-05-07T20:32:26.3714870Z compiled=False, 2025-05-07T20:32:26.3715066Z ) 2025-05-07T20:32:26.8225246Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:26.8226462Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:26.8227798Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:26.8229227Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:26.8230188Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.8231490Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:26.8232907Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:26.8233881Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.8235102Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:26.8236598Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:26.8237949Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.8239225Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:26.8240567Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:26.8241764Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:26.8243001Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:26.8243817Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:26.8244825Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:26.8245836Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:26.8246621Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:26.8247804Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:26.8249067Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:26.8250171Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:26.8251190Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:26.8252343Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:26.8253852Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:26.8254907Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:26.8255818Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:26.8256551Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:26.8257559Z W0507 20:32:26.819000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.0451838Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:27.0453119Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:27.0454453Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:27.0455856Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:27.0456945Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.0458237Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:27.0459771Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.0460749Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.0462039Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:27.0463784Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.0465087Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.0466661Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:27.0468187Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:27.0469684Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:27.0471150Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:27.0472145Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:27.0473391Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:27.0474625Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:27.0475578Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:27.0477059Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:27.0478629Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:27.0480110Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:27.0481379Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:27.0482820Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:27.0484597Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:27.0485893Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.0486994Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.0487877Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:27.0489108Z W0507 20:32:27.042000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.6382173Z self = 2025-05-07T20:32:27.6382721Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:27.6383030Z 2025-05-07T20:32:27.6383156Z @given( 2025-05-07T20:32:27.6383428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.6383744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.6384062Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.6384397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.6384718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.6385004Z ) 2025-05-07T20:32:27.6385354Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.6385789Z def test_silu_mul_quant( 2025-05-07T20:32:27.6386042Z self, 2025-05-07T20:32:27.6386254Z T: int, 2025-05-07T20:32:27.6386448Z D: int, 2025-05-07T20:32:27.6386669Z scale_ub: Optional[float], 2025-05-07T20:32:27.6386941Z contiguous: bool, 2025-05-07T20:32:27.6387177Z compiled: bool, 2025-05-07T20:32:27.6387409Z ) -> None: 2025-05-07T20:32:27.6387633Z torch.manual_seed(2025) 2025-05-07T20:32:27.6387874Z 2025-05-07T20:32:27.6388151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.6388492Z 2025-05-07T20:32:27.6388684Z x_sign = torch.sign(x) 2025-05-07T20:32:27.6388985Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.6389302Z x = x_sign * x_clamp 2025-05-07T20:32:27.6389551Z x0 = x[:, :D] 2025-05-07T20:32:27.6389766Z x1 = x[:, D:] 2025-05-07T20:32:27.6389983Z 2025-05-07T20:32:27.6390177Z if contiguous: 2025-05-07T20:32:27.6390413Z x0 = x0.contiguous() 2025-05-07T20:32:27.6390686Z x1 = x1.contiguous() 2025-05-07T20:32:27.6390936Z 2025-05-07T20:32:27.6391204Z if scale_ub is not None: 2025-05-07T20:32:27.6391517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.6391859Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.6392171Z ) 2025-05-07T20:32:27.6392382Z else: 2025-05-07T20:32:27.6392642Z scale_ub_tensor = None 2025-05-07T20:32:27.6392903Z 2025-05-07T20:32:27.6393147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.6393669Z op = silu_mul_quant 2025-05-07T20:32:27.6393926Z if compiled: 2025-05-07T20:32:27.6394189Z op = torch.compile(op) 2025-05-07T20:32:27.6394493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.6394778Z 2025-05-07T20:32:27.6394978Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.6395152Z 2025-05-07T20:32:27.6395258Z moe/activation_test.py:117: 2025-05-07T20:32:27.6395564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.6396015Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.6396312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.6397012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.6397703Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.6398249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.6398940Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.6399606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.6400139Z kernel = self.compile( 2025-05-07T20:32:27.6400689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.6401348Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.6401782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.6402109Z 2025-05-07T20:32:27.6402347Z self = 2025-05-07T20:32:27.6403427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.6404785Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb103822160>} 2025-05-07T20:32:27.6406108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.6407120Z context = 2025-05-07T20:32:27.6407414Z 2025-05-07T20:32:27.6407583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.6408101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.6408572Z module_map=module_map) 2025-05-07T20:32:27.6408938Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.6409302Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.6409573Z E ^ 2025-05-07T20:32:27.6410043Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.6410502Z 2025-05-07T20:32:27.6410917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.6411433Z 2025-05-07T20:32:27.6411544Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.6411964Z self=, 2025-05-07T20:32:27.6412364Z T=4096, 2025-05-07T20:32:27.6412564Z D=7168, 2025-05-07T20:32:27.6412766Z scale_ub=None, 2025-05-07T20:32:27.6413071Z contiguous=False, 2025-05-07T20:32:27.6413339Z compiled=False, 2025-05-07T20:32:27.6413556Z ) 2025-05-07T20:32:27.6413881Z self = 2025-05-07T20:32:27.6414480Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:27.6414761Z 2025-05-07T20:32:27.6414843Z @given( 2025-05-07T20:32:27.6415092Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.6415407Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.6415727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.6416064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.6416470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.6416765Z ) 2025-05-07T20:32:27.6417123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.6417562Z def test_silu_mul_quant( 2025-05-07T20:32:27.6417812Z self, 2025-05-07T20:32:27.6418018Z T: int, 2025-05-07T20:32:27.6418229Z D: int, 2025-05-07T20:32:27.6418474Z scale_ub: Optional[float], 2025-05-07T20:32:27.6418762Z contiguous: bool, 2025-05-07T20:32:27.6419023Z compiled: bool, 2025-05-07T20:32:27.6419262Z ) -> None: 2025-05-07T20:32:27.6419485Z torch.manual_seed(2025) 2025-05-07T20:32:27.6419740Z 2025-05-07T20:32:27.6420016Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.6420373Z 2025-05-07T20:32:27.6420579Z x_sign = torch.sign(x) 2025-05-07T20:32:27.6420879Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.6421204Z x = x_sign * x_clamp 2025-05-07T20:32:27.6421463Z x0 = x[:, :D] 2025-05-07T20:32:27.6421696Z x1 = x[:, D:] 2025-05-07T20:32:27.6421915Z 2025-05-07T20:32:27.6422121Z if contiguous: 2025-05-07T20:32:27.6422367Z x0 = x0.contiguous() 2025-05-07T20:32:27.6422635Z x1 = x1.contiguous() 2025-05-07T20:32:27.6422889Z 2025-05-07T20:32:27.6423099Z if scale_ub is not None: 2025-05-07T20:32:27.6423379Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.6423732Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.6424052Z ) 2025-05-07T20:32:27.6424251Z else: 2025-05-07T20:32:27.6424477Z scale_ub_tensor = None 2025-05-07T20:32:27.6424744Z 2025-05-07T20:32:27.6424981Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.6425307Z op = silu_mul_quant 2025-05-07T20:32:27.6425572Z if compiled: 2025-05-07T20:32:27.6425835Z op = torch.compile(op) 2025-05-07T20:32:27.6426148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.6426438Z 2025-05-07T20:32:27.6426641Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.6426814Z 2025-05-07T20:32:27.6426918Z moe/activation_test.py:117: 2025-05-07T20:32:27.6427229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.6427574Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.6427863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.6428568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.6429266Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.6429810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.6430500Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.6431174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.6431710Z kernel = self.compile( 2025-05-07T20:32:27.6432254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.6432959Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.6433356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.6433668Z 2025-05-07T20:32:27.6433884Z self = 2025-05-07T20:32:27.6434954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.6436308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb103823240>} 2025-05-07T20:32:27.6437706Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.6438724Z context = 2025-05-07T20:32:27.6439013Z 2025-05-07T20:32:27.6439195Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.6439715Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.6440192Z module_map=module_map) 2025-05-07T20:32:27.6440572Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.6440932Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.6441203Z E ^ 2025-05-07T20:32:27.6441680Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.6442126Z 2025-05-07T20:32:27.6442551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.6443098Z 2025-05-07T20:32:27.6443224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.6443645Z self=, 2025-05-07T20:32:27.6444058Z T=128, 2025-05-07T20:32:27.6444261Z D=7168, 2025-05-07T20:32:27.6444468Z scale_ub=None, 2025-05-07T20:32:27.6444693Z contiguous=False, 2025-05-07T20:32:27.6444925Z compiled=True, 2025-05-07T20:32:27.6445133Z ) 2025-05-07T20:32:27.7005628Z self = 2025-05-07T20:32:27.7006673Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:27.7007208Z 2025-05-07T20:32:27.7007389Z @given( 2025-05-07T20:32:27.7007831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.7008449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.7009053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.7009695Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.7010344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.7010912Z ) 2025-05-07T20:32:27.7011587Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.7012461Z def test_silu_mul_quant( 2025-05-07T20:32:27.7012824Z self, 2025-05-07T20:32:27.7013131Z T: int, 2025-05-07T20:32:27.7013334Z D: int, 2025-05-07T20:32:27.7013553Z scale_ub: Optional[float], 2025-05-07T20:32:27.7013824Z contiguous: bool, 2025-05-07T20:32:27.7014059Z compiled: bool, 2025-05-07T20:32:27.7014289Z ) -> None: 2025-05-07T20:32:27.7014506Z torch.manual_seed(2025) 2025-05-07T20:32:27.7014746Z 2025-05-07T20:32:27.7015024Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.7015369Z 2025-05-07T20:32:27.7015564Z x_sign = torch.sign(x) 2025-05-07T20:32:27.7015854Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.7016155Z x = x_sign * x_clamp 2025-05-07T20:32:27.7016391Z x0 = x[:, :D] 2025-05-07T20:32:27.7016608Z x1 = x[:, D:] 2025-05-07T20:32:27.7016817Z 2025-05-07T20:32:27.7016997Z if contiguous: 2025-05-07T20:32:27.7017395Z x0 = x0.contiguous() 2025-05-07T20:32:27.7017657Z x1 = x1.contiguous() 2025-05-07T20:32:27.7017892Z 2025-05-07T20:32:27.7018088Z if scale_ub is not None: 2025-05-07T20:32:27.7018362Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.7018689Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.7018999Z ) 2025-05-07T20:32:27.7019311Z else: 2025-05-07T20:32:27.7019526Z scale_ub_tensor = None 2025-05-07T20:32:27.7019771Z 2025-05-07T20:32:27.7020004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.7020316Z op = silu_mul_quant 2025-05-07T20:32:27.7020560Z if compiled: 2025-05-07T20:32:27.7020813Z op = torch.compile(op) 2025-05-07T20:32:27.7021112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.7021379Z 2025-05-07T20:32:27.7021574Z y_fp8, y_scale = fn() 2025-05-07T20:32:27.7021869Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:27.7022156Z 2025-05-07T20:32:27.7022403Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.7022733Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:27.7023018Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:27.7023332Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:27.7023688Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.7023997Z 2025-05-07T20:32:27.7024194Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:27.7024388Z 2025-05-07T20:32:27.7024487Z moe/activation_test.py:126: 2025-05-07T20:32:27.7024786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.7025114Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:27.7025440Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:27.7026222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:27.7026970Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:27.7027501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.7028171Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.7028854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:27.7029560Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:27.7030279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:27.7030910Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:27.7031514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:27.7032020Z fn() 2025-05-07T20:32:27.7032639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:27.7033354Z self.fn.run( 2025-05-07T20:32:27.7033927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.7034577Z kernel = self.compile( 2025-05-07T20:32:27.7035116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.7035762Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.7036154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.7036382Z 2025-05-07T20:32:27.7036589Z self = 2025-05-07T20:32:27.7037766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.7039122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10822ede0>} 2025-05-07T20:32:27.7040513Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.7041517Z context = 2025-05-07T20:32:27.7041807Z 2025-05-07T20:32:27.7041973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.7042599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.7043182Z module_map=module_map) 2025-05-07T20:32:27.7043643Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.7044091Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:27.7044398Z E ^ 2025-05-07T20:32:27.7044859Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.7045319Z 2025-05-07T20:32:27.7045729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.7046231Z 2025-05-07T20:32:27.7046344Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.7046753Z self=, 2025-05-07T20:32:27.7047149Z T=128, 2025-05-07T20:32:27.7047346Z D=7168, 2025-05-07T20:32:27.7047545Z scale_ub=None, 2025-05-07T20:32:27.7047764Z contiguous=False, 2025-05-07T20:32:27.7048004Z compiled=False, 2025-05-07T20:32:27.7048217Z ) 2025-05-07T20:32:27.8995915Z self = 2025-05-07T20:32:27.8996522Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:27.8996918Z 2025-05-07T20:32:27.8997003Z @given( 2025-05-07T20:32:27.8997238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.8997551Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.8997869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.8998204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.8998534Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.8998814Z ) 2025-05-07T20:32:27.8999164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.8999603Z def test_silu_mul_quant( 2025-05-07T20:32:27.8999844Z self, 2025-05-07T20:32:27.9000041Z T: int, 2025-05-07T20:32:27.9000243Z D: int, 2025-05-07T20:32:27.9000458Z scale_ub: Optional[float], 2025-05-07T20:32:27.9000729Z contiguous: bool, 2025-05-07T20:32:27.9000973Z compiled: bool, 2025-05-07T20:32:27.9001191Z ) -> None: 2025-05-07T20:32:27.9001412Z torch.manual_seed(2025) 2025-05-07T20:32:27.9001652Z 2025-05-07T20:32:27.9001922Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.9002271Z 2025-05-07T20:32:27.9002466Z x_sign = torch.sign(x) 2025-05-07T20:32:27.9002759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.9003062Z x = x_sign * x_clamp 2025-05-07T20:32:27.9003308Z x0 = x[:, :D] 2025-05-07T20:32:27.9003530Z x1 = x[:, D:] 2025-05-07T20:32:27.9003737Z 2025-05-07T20:32:27.9003925Z if contiguous: 2025-05-07T20:32:27.9004175Z x0 = x0.contiguous() 2025-05-07T20:32:27.9004433Z x1 = x1.contiguous() 2025-05-07T20:32:27.9004676Z 2025-05-07T20:32:27.9005099Z if scale_ub is not None: 2025-05-07T20:32:27.9005440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.9012504Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.9012853Z ) 2025-05-07T20:32:27.9013117Z else: 2025-05-07T20:32:27.9013324Z scale_ub_tensor = None 2025-05-07T20:32:27.9013578Z 2025-05-07T20:32:27.9013987Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.9014300Z op = silu_mul_quant 2025-05-07T20:32:27.9014550Z if compiled: 2025-05-07T20:32:27.9014797Z op = torch.compile(op) 2025-05-07T20:32:27.9015087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.9015360Z 2025-05-07T20:32:27.9015557Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.9015719Z 2025-05-07T20:32:27.9015818Z moe/activation_test.py:117: 2025-05-07T20:32:27.9016118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.9016450Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.9016730Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.9017410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.9018093Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.9018624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.9019290Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.9019938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.9020463Z kernel = self.compile( 2025-05-07T20:32:27.9020995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.9021634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.9022026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.9022249Z 2025-05-07T20:32:27.9022457Z self = 2025-05-07T20:32:27.9023521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.9024868Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb109960220>} 2025-05-07T20:32:27.9026180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.9027184Z context = 2025-05-07T20:32:27.9027465Z 2025-05-07T20:32:27.9027635Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.9028138Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.9028602Z module_map=module_map) 2025-05-07T20:32:27.9028974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.9029321Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.9029578Z E ^ 2025-05-07T20:32:27.9030039Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.9030479Z 2025-05-07T20:32:27.9030894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.9031391Z 2025-05-07T20:32:27.9031598Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.9032007Z self=, 2025-05-07T20:32:27.9032406Z T=4096, 2025-05-07T20:32:27.9032595Z D=5120, 2025-05-07T20:32:27.9032810Z scale_ub=1200.0, 2025-05-07T20:32:27.9033057Z contiguous=True, 2025-05-07T20:32:27.9033282Z compiled=False, 2025-05-07T20:32:27.9033482Z ) 2025-05-07T20:32:27.9033801Z self = 2025-05-07T20:32:27.9034364Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:27.9034629Z 2025-05-07T20:32:27.9034707Z @given( 2025-05-07T20:32:27.9034932Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:27.9035240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:27.9035536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:27.9035861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:27.9036193Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:27.9036475Z ) 2025-05-07T20:32:27.9036816Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:27.9037249Z def test_silu_mul_quant( 2025-05-07T20:32:27.9037491Z self, 2025-05-07T20:32:27.9037679Z T: int, 2025-05-07T20:32:27.9037873Z D: int, 2025-05-07T20:32:27.9038096Z scale_ub: Optional[float], 2025-05-07T20:32:27.9038367Z contiguous: bool, 2025-05-07T20:32:27.9038606Z compiled: bool, 2025-05-07T20:32:27.9038828Z ) -> None: 2025-05-07T20:32:27.9039035Z torch.manual_seed(2025) 2025-05-07T20:32:27.9039274Z 2025-05-07T20:32:27.9039539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:27.9039868Z 2025-05-07T20:32:27.9040061Z x_sign = torch.sign(x) 2025-05-07T20:32:27.9040351Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:27.9040662Z x = x_sign * x_clamp 2025-05-07T20:32:27.9040906Z x0 = x[:, :D] 2025-05-07T20:32:27.9041142Z x1 = x[:, D:] 2025-05-07T20:32:27.9041353Z 2025-05-07T20:32:27.9041535Z if contiguous: 2025-05-07T20:32:27.9041773Z x0 = x0.contiguous() 2025-05-07T20:32:27.9042031Z x1 = x1.contiguous() 2025-05-07T20:32:27.9042265Z 2025-05-07T20:32:27.9042460Z if scale_ub is not None: 2025-05-07T20:32:27.9042745Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:27.9043128Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:27.9043428Z ) 2025-05-07T20:32:27.9043625Z else: 2025-05-07T20:32:27.9043845Z scale_ub_tensor = None 2025-05-07T20:32:27.9044091Z 2025-05-07T20:32:27.9044325Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:27.9044635Z op = silu_mul_quant 2025-05-07T20:32:27.9044876Z if compiled: 2025-05-07T20:32:27.9045122Z op = torch.compile(op) 2025-05-07T20:32:27.9045419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.9045685Z 2025-05-07T20:32:27.9045878Z > y_fp8, y_scale = fn() 2025-05-07T20:32:27.9046039Z 2025-05-07T20:32:27.9046143Z moe/activation_test.py:117: 2025-05-07T20:32:27.9046433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.9046761Z moe/activation_test.py:115: in fn 2025-05-07T20:32:27.9047044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:27.9047730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:27.9048405Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:27.9048937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:27.9049607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:27.9050339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:27.9050869Z kernel = self.compile( 2025-05-07T20:32:27.9051406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:27.9052045Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:27.9052425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:27.9052734Z 2025-05-07T20:32:27.9052939Z self = 2025-05-07T20:32:27.9054069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:27.9055417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1038b09a0>} 2025-05-07T20:32:27.9056728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:27.9057734Z context = 2025-05-07T20:32:27.9058029Z 2025-05-07T20:32:27.9058192Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:27.9058707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:27.9059166Z module_map=module_map) 2025-05-07T20:32:27.9059891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:27.9060245Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:27.9060500Z E ^ 2025-05-07T20:32:27.9060957Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:27.9061403Z 2025-05-07T20:32:27.9061812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:27.9062311Z 2025-05-07T20:32:27.9062419Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:27.9062830Z self=, 2025-05-07T20:32:27.9063229Z T=1, 2025-05-07T20:32:27.9063412Z D=5120, 2025-05-07T20:32:27.9063606Z scale_ub=None, 2025-05-07T20:32:27.9063818Z contiguous=True, 2025-05-07T20:32:27.9064045Z compiled=True, 2025-05-07T20:32:27.9064246Z ) 2025-05-07T20:32:28.1384748Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:28.1385864Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:28.1387213Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:28.1389062Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:28.1390286Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.1392255Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:28.1393693Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.1394668Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.1396102Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:28.1397473Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.1398544Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.1399814Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:28.1401046Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:28.1402260Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:28.1403462Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:28.1404344Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.1405354Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:28.1406365Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:28.1407162Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:28.1408366Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:28.1409644Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:28.1410746Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:28.1411783Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:28.1412955Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:28.1414407Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:28.1415553Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.1416664Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.1417557Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:28.1418809Z W0507 20:32:28.135000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.2082033Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:28.2083096Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:28.2085627Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:28.2088432Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:28.2090375Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.2092937Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:28.2094782Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.2095749Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.2096961Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:28.2098328Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.2099383Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.2100653Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:28.2101880Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:28.2103089Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:28.2104288Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:28.2105113Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.2106438Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:28.2107455Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:28.2108249Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:28.2109578Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:28.2110843Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:28.2111941Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:28.2112975Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:28.2114139Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:28.2115489Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:28.2116529Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.2117436Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.2118172Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:28.2119180Z W0507 20:32:28.205000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5019990Z self = 2025-05-07T20:32:28.5021006Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.5021536Z 2025-05-07T20:32:28.5021700Z @given( 2025-05-07T20:32:28.5022176Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.5022819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.5023427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.5023832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.5024218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.5024508Z ) 2025-05-07T20:32:28.5024874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.5025324Z def test_silu_mul_quant( 2025-05-07T20:32:28.5025563Z self, 2025-05-07T20:32:28.5025769Z T: int, 2025-05-07T20:32:28.5025977Z D: int, 2025-05-07T20:32:28.5026211Z scale_ub: Optional[float], 2025-05-07T20:32:28.5026492Z contiguous: bool, 2025-05-07T20:32:28.5026742Z compiled: bool, 2025-05-07T20:32:28.5026973Z ) -> None: 2025-05-07T20:32:28.5027206Z torch.manual_seed(2025) 2025-05-07T20:32:28.5027460Z 2025-05-07T20:32:28.5027735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.5028172Z 2025-05-07T20:32:28.5028461Z x_sign = torch.sign(x) 2025-05-07T20:32:28.5028786Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.5029465Z x = x_sign * x_clamp 2025-05-07T20:32:28.5029733Z x0 = x[:, :D] 2025-05-07T20:32:28.5029972Z x1 = x[:, D:] 2025-05-07T20:32:28.5030226Z 2025-05-07T20:32:28.5030426Z if contiguous: 2025-05-07T20:32:28.5030682Z x0 = x0.contiguous() 2025-05-07T20:32:28.5030960Z x1 = x1.contiguous() 2025-05-07T20:32:28.5031212Z 2025-05-07T20:32:28.5031424Z if scale_ub is not None: 2025-05-07T20:32:28.5031882Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.5032228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.5032552Z ) 2025-05-07T20:32:28.5032770Z else: 2025-05-07T20:32:28.5032996Z scale_ub_tensor = None 2025-05-07T20:32:28.5033275Z 2025-05-07T20:32:28.5033531Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5033856Z op = silu_mul_quant 2025-05-07T20:32:28.5034127Z if compiled: 2025-05-07T20:32:28.5034403Z op = torch.compile(op) 2025-05-07T20:32:28.5034719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.5035002Z 2025-05-07T20:32:28.5035216Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.5035529Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.5035831Z 2025-05-07T20:32:28.5036087Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.5036447Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.5036744Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.5037069Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.5037441Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.5037754Z 2025-05-07T20:32:28.5037975Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.5038183Z 2025-05-07T20:32:28.5038292Z moe/activation_test.py:126: 2025-05-07T20:32:28.5038613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5038959Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.5039299Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.5040102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.5040855Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.5041418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.5042107Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.5042803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.5043528Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.5044270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.5044918Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.5045529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.5046050Z fn() 2025-05-07T20:32:28.5046565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.5047162Z self.fn.run( 2025-05-07T20:32:28.5047635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.5048175Z kernel = self.compile( 2025-05-07T20:32:28.5048727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.5049386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.5049898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.5050145Z 2025-05-07T20:32:28.5050359Z self = 2025-05-07T20:32:28.5051446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.5052941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1038b2840>} 2025-05-07T20:32:28.5054424Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.5055457Z context = 2025-05-07T20:32:28.5055767Z 2025-05-07T20:32:28.5055943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.5056472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.5056946Z module_map=module_map) 2025-05-07T20:32:28.5057333Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.5057710Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.5057999Z E ^ 2025-05-07T20:32:28.5058482Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.5058942Z 2025-05-07T20:32:28.5059659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.5060166Z 2025-05-07T20:32:28.5060285Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.5060710Z self=, 2025-05-07T20:32:28.5061121Z T=2048, 2025-05-07T20:32:28.5061323Z D=5120, 2025-05-07T20:32:28.5061523Z scale_ub=None, 2025-05-07T20:32:28.5061754Z contiguous=True, 2025-05-07T20:32:28.5061990Z compiled=True, 2025-05-07T20:32:28.5062204Z ) 2025-05-07T20:32:28.7248431Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:28.7249532Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:28.7250866Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:28.7252286Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:28.7253373Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.7254663Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:28.7256039Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7257016Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.7258596Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:28.7260211Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7261431Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.7262698Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:28.7263941Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:28.7265139Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:28.7266335Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:28.7267165Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.7268180Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:28.7269193Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:28.7269971Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:28.7271169Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:28.7272449Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:28.7273560Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:28.7274581Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:28.7275751Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:28.7277094Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:28.7278150Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7279055Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7279780Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:28.7280906Z W0507 20:32:28.721000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.7939378Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:28.7941436Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:28.7943703Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:28.7945094Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:28.7946067Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.7947354Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:28.7948718Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.7949688Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.7950897Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:28.7952251Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.7953315Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.7954589Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:28.7955813Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:28.7957022Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:28.7958207Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:28.7959027Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:28.7960469Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:28.7961709Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:28.7962791Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:28.7963992Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:28.7965256Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:28.7966469Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:28.7967496Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:28.7968656Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:28.7969991Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:28.7971039Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.7971950Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.7972670Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:28.7973829Z W0507 20:32:28.790000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.0895599Z self = 2025-05-07T20:32:29.0896146Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.0896416Z 2025-05-07T20:32:29.0896497Z @given( 2025-05-07T20:32:29.0896734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.0897053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.0897361Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.0897686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.0903659Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.0903999Z ) 2025-05-07T20:32:29.0904372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.0904816Z def test_silu_mul_quant( 2025-05-07T20:32:29.0905045Z self, 2025-05-07T20:32:29.0905241Z T: int, 2025-05-07T20:32:29.0905431Z D: int, 2025-05-07T20:32:29.0905648Z scale_ub: Optional[float], 2025-05-07T20:32:29.0905910Z contiguous: bool, 2025-05-07T20:32:29.0906142Z compiled: bool, 2025-05-07T20:32:29.0906358Z ) -> None: 2025-05-07T20:32:29.0906570Z torch.manual_seed(2025) 2025-05-07T20:32:29.0906806Z 2025-05-07T20:32:29.0907071Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.0907401Z 2025-05-07T20:32:29.0907592Z x_sign = torch.sign(x) 2025-05-07T20:32:29.0907871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.0908173Z x = x_sign * x_clamp 2025-05-07T20:32:29.0908410Z x0 = x[:, :D] 2025-05-07T20:32:29.0908623Z x1 = x[:, D:] 2025-05-07T20:32:29.0908821Z 2025-05-07T20:32:29.0909001Z if contiguous: 2025-05-07T20:32:29.0909227Z x0 = x0.contiguous() 2025-05-07T20:32:29.0909467Z x1 = x1.contiguous() 2025-05-07T20:32:29.0909696Z 2025-05-07T20:32:29.0909883Z if scale_ub is not None: 2025-05-07T20:32:29.0910336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.0910671Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.0910972Z ) 2025-05-07T20:32:29.0911152Z else: 2025-05-07T20:32:29.0911358Z scale_ub_tensor = None 2025-05-07T20:32:29.0911603Z 2025-05-07T20:32:29.0911821Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.0912270Z op = silu_mul_quant 2025-05-07T20:32:29.0912510Z if compiled: 2025-05-07T20:32:29.0912746Z op = torch.compile(op) 2025-05-07T20:32:29.0913032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.0913298Z 2025-05-07T20:32:29.0913487Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.0913760Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.0914041Z 2025-05-07T20:32:29.0914274Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.0914604Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.0914889Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.0915195Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.0915539Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.0915845Z 2025-05-07T20:32:29.0916041Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.0916231Z 2025-05-07T20:32:29.0916337Z moe/activation_test.py:126: 2025-05-07T20:32:29.0916625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0916950Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.0917266Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.0918036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.0918771Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.0919306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.0919969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.0920642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.0921347Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.0922062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.0922681Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.0923266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.0923770Z fn() 2025-05-07T20:32:29.0924271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.0924831Z self.fn.run( 2025-05-07T20:32:29.0925289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.0925801Z kernel = self.compile( 2025-05-07T20:32:29.0926322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.0926962Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.0927348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.0927569Z 2025-05-07T20:32:29.0927774Z self = 2025-05-07T20:32:29.0928909Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.0930258Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1097e8e00>} 2025-05-07T20:32:29.0931568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.0932644Z context = 2025-05-07T20:32:29.0932922Z 2025-05-07T20:32:29.0933180Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.0933734Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.0934195Z module_map=module_map) 2025-05-07T20:32:29.0934549Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.0934906Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.0935163Z E ^ 2025-05-07T20:32:29.0935617Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.0936056Z 2025-05-07T20:32:29.0936465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.0936962Z 2025-05-07T20:32:29.0937071Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.0937475Z self=, 2025-05-07T20:32:29.0937867Z T=128, 2025-05-07T20:32:29.0938053Z D=5120, 2025-05-07T20:32:29.0938240Z scale_ub=None, 2025-05-07T20:32:29.0938447Z contiguous=True, 2025-05-07T20:32:29.0938668Z compiled=True, 2025-05-07T20:32:29.0938868Z ) 2025-05-07T20:32:29.3299591Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:29.3300649Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:29.3301969Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:29.3303368Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:29.3304322Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.3305609Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:29.3306964Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.3307933Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.3309137Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:29.3310694Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.3311744Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.3313002Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:29.3314336Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:29.3315533Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:29.3316722Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:29.3317530Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.3318531Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:29.3319536Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:29.3320304Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:29.3321490Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:29.3322750Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:29.3323847Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:29.3324861Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:29.3326015Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:29.3327345Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:29.3328539Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.3329439Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.3330156Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:29.3331163Z W0507 20:32:29.326000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.3999856Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:29.4001044Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:29.4002362Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:29.4003757Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:29.4004855Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.4006137Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:29.4007502Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.4008470Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.4009678Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:29.4011038Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.4012089Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.4013429Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:29.4014657Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:29.4015870Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:29.4017058Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:29.4017884Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.4018901Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:29.4019912Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:29.4020711Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:29.4021898Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:29.4023256Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:29.4024411Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:29.4025442Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:29.4026604Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:29.4028020Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:29.4029074Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.4029971Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.4030701Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:29.4031692Z W0507 20:32:29.397000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.7403283Z self = 2025-05-07T20:32:29.7403982Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:29.7404256Z 2025-05-07T20:32:29.7404339Z @given( 2025-05-07T20:32:29.7404578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.7404891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.7405217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.7405556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.7405895Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.7406173Z ) 2025-05-07T20:32:29.7406522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.7406967Z def test_silu_mul_quant( 2025-05-07T20:32:29.7407204Z self, 2025-05-07T20:32:29.7407409Z T: int, 2025-05-07T20:32:29.7407613Z D: int, 2025-05-07T20:32:29.7407830Z scale_ub: Optional[float], 2025-05-07T20:32:29.7408102Z contiguous: bool, 2025-05-07T20:32:29.7408351Z compiled: bool, 2025-05-07T20:32:29.7408575Z ) -> None: 2025-05-07T20:32:29.7408800Z torch.manual_seed(2025) 2025-05-07T20:32:29.7409050Z 2025-05-07T20:32:29.7409318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.7409664Z 2025-05-07T20:32:29.7409866Z x_sign = torch.sign(x) 2025-05-07T20:32:29.7410157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.7410463Z x = x_sign * x_clamp 2025-05-07T20:32:29.7410710Z x0 = x[:, :D] 2025-05-07T20:32:29.7410928Z x1 = x[:, D:] 2025-05-07T20:32:29.7411129Z 2025-05-07T20:32:29.7411325Z if contiguous: 2025-05-07T20:32:29.7411557Z x0 = x0.contiguous() 2025-05-07T20:32:29.7411826Z x1 = x1.contiguous() 2025-05-07T20:32:29.7412070Z 2025-05-07T20:32:29.7412262Z if scale_ub is not None: 2025-05-07T20:32:29.7412542Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.7412886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.7413262Z ) 2025-05-07T20:32:29.7413489Z else: 2025-05-07T20:32:29.7413703Z scale_ub_tensor = None 2025-05-07T20:32:29.7413956Z 2025-05-07T20:32:29.7414182Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.7414669Z op = silu_mul_quant 2025-05-07T20:32:29.7414934Z if compiled: 2025-05-07T20:32:29.7415183Z op = torch.compile(op) 2025-05-07T20:32:29.7415489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.7415766Z 2025-05-07T20:32:29.7415963Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.7416253Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.7416665Z 2025-05-07T20:32:29.7416897Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.7417229Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.7417521Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.7417827Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.7418181Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.7418490Z 2025-05-07T20:32:29.7418696Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.7418888Z 2025-05-07T20:32:29.7419000Z moe/activation_test.py:126: 2025-05-07T20:32:29.7419304Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.7419643Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.7419962Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.7420746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.7421497Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.7422042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.7422714Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.7423396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.7424117Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.7424844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.7425472Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.7426068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.7426588Z fn() 2025-05-07T20:32:29.7427092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.7427669Z self.fn.run( 2025-05-07T20:32:29.7428138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.7428666Z kernel = self.compile( 2025-05-07T20:32:29.7429200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.7429845Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.7430244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.7430469Z 2025-05-07T20:32:29.7430676Z self = 2025-05-07T20:32:29.7431751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.7433119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102b08cc0>} 2025-05-07T20:32:29.7434496Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.7435594Z context = 2025-05-07T20:32:29.7435883Z 2025-05-07T20:32:29.7436050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.7436571Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.7437044Z module_map=module_map) 2025-05-07T20:32:29.7437492Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.7437848Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.7438124Z E ^ 2025-05-07T20:32:29.7438589Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.7439034Z 2025-05-07T20:32:29.7439445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.7439956Z 2025-05-07T20:32:29.7440071Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.7440484Z self=, 2025-05-07T20:32:29.7440889Z T=4096, 2025-05-07T20:32:29.7441077Z D=5120, 2025-05-07T20:32:29.7441279Z scale_ub=None, 2025-05-07T20:32:29.7441500Z contiguous=True, 2025-05-07T20:32:29.7441721Z compiled=True, 2025-05-07T20:32:29.7441927Z ) 2025-05-07T20:32:29.9813918Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:29.9814987Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:29.9816304Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:29.9817708Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:29.9818664Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.9819949Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:29.9821304Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.9822264Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.9823471Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:29.9824821Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.9825875Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.9827602Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:29.9828835Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:29.9830035Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:29.9831338Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:29.9832156Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:29.9833158Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:29.9834157Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:29.9834937Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:29.9836130Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:29.9837391Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:29.9838490Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:29.9839508Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:29.9840667Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:29.9842001Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:29.9843047Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.9843985Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.9844720Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:29.9845725Z W0507 20:32:29.978000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.0513005Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:30.0514115Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:30.0515425Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:30.0516972Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:30.0517941Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:30.0519221Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:30.0520692Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.0521662Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:30.0523032Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:30.0524390Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.0525444Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:30.0526710Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:30.0527939Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:30.0529141Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:30.0530332Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:30.0531151Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:30.0532176Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:30.0533256Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:30.0534090Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:30.0535283Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:30.0536556Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:30.0537666Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:30.0538694Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:30.0539939Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:30.0541281Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:30.0542404Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.0543314Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.0544090Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:30.0545105Z W0507 20:32:30.048000 98555 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.3942061Z self = 2025-05-07T20:32:30.3942601Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.3942869Z 2025-05-07T20:32:30.3942956Z @given( 2025-05-07T20:32:30.3943187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.3943778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.3944389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.3945042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.3945682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.3946237Z ) 2025-05-07T20:32:30.3946917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.3947782Z def test_silu_mul_quant( 2025-05-07T20:32:30.3948267Z self, 2025-05-07T20:32:30.3948664Z T: int, 2025-05-07T20:32:30.3949049Z D: int, 2025-05-07T20:32:30.3949487Z scale_ub: Optional[float], 2025-05-07T20:32:30.3950023Z contiguous: bool, 2025-05-07T20:32:30.3950488Z compiled: bool, 2025-05-07T20:32:30.3950939Z ) -> None: 2025-05-07T20:32:30.3951359Z torch.manual_seed(2025) 2025-05-07T20:32:30.3951824Z 2025-05-07T20:32:30.3952360Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.3953056Z 2025-05-07T20:32:30.3953434Z x_sign = torch.sign(x) 2025-05-07T20:32:30.3953757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.3954087Z x = x_sign * x_clamp 2025-05-07T20:32:30.3954334Z x0 = x[:, :D] 2025-05-07T20:32:30.3954549Z x1 = x[:, D:] 2025-05-07T20:32:30.3954757Z 2025-05-07T20:32:30.3954947Z if contiguous: 2025-05-07T20:32:30.3955176Z x0 = x0.contiguous() 2025-05-07T20:32:30.3955450Z x1 = x1.contiguous() 2025-05-07T20:32:30.3955694Z 2025-05-07T20:32:30.3955889Z if scale_ub is not None: 2025-05-07T20:32:30.3956165Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.3956505Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.3956813Z ) 2025-05-07T20:32:30.3957013Z else: 2025-05-07T20:32:30.3957228Z scale_ub_tensor = None 2025-05-07T20:32:30.3957483Z 2025-05-07T20:32:30.3957721Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.3958035Z op = silu_mul_quant 2025-05-07T20:32:30.3958283Z if compiled: 2025-05-07T20:32:30.3958538Z op = torch.compile(op) 2025-05-07T20:32:30.3958835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.3959111Z 2025-05-07T20:32:30.3959477Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.3959764Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.3960048Z 2025-05-07T20:32:30.3960447Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.3960783Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.3961075Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.3961383Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.3961737Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.3962162Z 2025-05-07T20:32:30.3962362Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.3962558Z 2025-05-07T20:32:30.3962657Z moe/activation_test.py:126: 2025-05-07T20:32:30.3969050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.3969406Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.3969732Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.3970527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.3971271Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.3971809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.3972474Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.3973218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.3973932Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.3974651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.3975273Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.3975865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.3976373Z fn() 2025-05-07T20:32:30.3976876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.3977446Z self.fn.run( 2025-05-07T20:32:30.3977909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.3978433Z kernel = self.compile( 2025-05-07T20:32:30.3978961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.3979603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.3979994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.3980222Z 2025-05-07T20:32:30.3980427Z self = 2025-05-07T20:32:30.3981488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.3982836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102b0ad40>} 2025-05-07T20:32:30.3984202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.3985215Z context = 2025-05-07T20:32:30.3985498Z 2025-05-07T20:32:30.3985667Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.3986175Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.3986642Z module_map=module_map) 2025-05-07T20:32:30.3987111Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.3987461Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.3987729Z E ^ 2025-05-07T20:32:30.3988189Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.3988630Z 2025-05-07T20:32:30.3989037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.3989643Z 2025-05-07T20:32:30.3989755Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.3990160Z self=, 2025-05-07T20:32:30.3990552Z T=16384, 2025-05-07T20:32:30.3990739Z D=5120, 2025-05-07T20:32:30.3990932Z scale_ub=None, 2025-05-07T20:32:30.3991146Z contiguous=True, 2025-05-07T20:32:30.3991367Z compiled=True, 2025-05-07T20:32:30.3991569Z ) 2025-05-07T20:32:30.4242071Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:30.4243342Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:30.4244677Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:30.4245662Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:30.4246759Z W0507 20:32:30.422000 98555 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:30.5124276Z self = 2025-05-07T20:32:30.5124808Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.5125082Z 2025-05-07T20:32:30.5125168Z @given( 2025-05-07T20:32:30.5125396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.5125708Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.5126017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.5126352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.5126677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.5126961Z ) 2025-05-07T20:32:30.5127313Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.5127819Z def test_silu_mul_quant( 2025-05-07T20:32:30.5128076Z self, 2025-05-07T20:32:30.5128282Z T: int, 2025-05-07T20:32:30.5128482Z D: int, 2025-05-07T20:32:30.5128716Z scale_ub: Optional[float], 2025-05-07T20:32:30.5129020Z contiguous: bool, 2025-05-07T20:32:30.5129274Z compiled: bool, 2025-05-07T20:32:30.5129514Z ) -> None: 2025-05-07T20:32:30.5129745Z torch.manual_seed(2025) 2025-05-07T20:32:30.5130004Z 2025-05-07T20:32:30.5130297Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.5130686Z 2025-05-07T20:32:30.5130885Z x_sign = torch.sign(x) 2025-05-07T20:32:30.5131211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.5131558Z x = x_sign * x_clamp 2025-05-07T20:32:30.5131809Z x0 = x[:, :D] 2025-05-07T20:32:30.5132040Z x1 = x[:, D:] 2025-05-07T20:32:30.5132259Z 2025-05-07T20:32:30.5132448Z if contiguous: 2025-05-07T20:32:30.5132696Z x0 = x0.contiguous() 2025-05-07T20:32:30.5132973Z x1 = x1.contiguous() 2025-05-07T20:32:30.5133287Z 2025-05-07T20:32:30.5133482Z if scale_ub is not None: 2025-05-07T20:32:30.5133948Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.5134291Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.5134592Z ) 2025-05-07T20:32:30.5134787Z else: 2025-05-07T20:32:30.5135007Z scale_ub_tensor = None 2025-05-07T20:32:30.5135260Z 2025-05-07T20:32:30.5135493Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.5135805Z op = silu_mul_quant 2025-05-07T20:32:30.5136175Z if compiled: 2025-05-07T20:32:30.5136422Z op = torch.compile(op) 2025-05-07T20:32:30.5136718Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.5136982Z 2025-05-07T20:32:30.5137177Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.5137458Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.5137739Z 2025-05-07T20:32:30.5137973Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.5138300Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.5138593Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.5138897Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.5139249Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.5139552Z 2025-05-07T20:32:30.5139751Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.5139945Z 2025-05-07T20:32:30.5140045Z moe/activation_test.py:126: 2025-05-07T20:32:30.5140343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.5140663Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.5140984Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.5141762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.5142507Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.5143039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.5143711Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.5144388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.5145100Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.5145821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.5146447Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.5147046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.5147547Z fn() 2025-05-07T20:32:30.5148058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.5148637Z self.fn.run( 2025-05-07T20:32:30.5149102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.5149619Z kernel = self.compile( 2025-05-07T20:32:30.5150166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.5150814Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.5151210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.5151441Z 2025-05-07T20:32:30.5151646Z self = 2025-05-07T20:32:30.5152721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.5154240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102c82840>} 2025-05-07T20:32:30.5155566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.5156639Z context = 2025-05-07T20:32:30.5156930Z 2025-05-07T20:32:30.5157095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.5157613Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.5158080Z module_map=module_map) 2025-05-07T20:32:30.5158441Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.5158795Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.5159076Z E ^ 2025-05-07T20:32:30.5159700Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.5160163Z 2025-05-07T20:32:30.5160576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.5161084Z 2025-05-07T20:32:30.5161192Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.5161607Z self=, 2025-05-07T20:32:30.5162001Z T=1, 2025-05-07T20:32:30.5162190Z D=5120, 2025-05-07T20:32:30.5162382Z scale_ub=1200.0, 2025-05-07T20:32:30.5162597Z contiguous=True, 2025-05-07T20:32:30.5162818Z compiled=True, 2025-05-07T20:32:30.5163024Z ) 2025-05-07T20:32:30.6543647Z self = 2025-05-07T20:32:30.6544818Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:30.6545571Z 2025-05-07T20:32:30.6545770Z @given( 2025-05-07T20:32:30.6546357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.6547020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.6547629Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.6548272Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.6548907Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.6549471Z ) 2025-05-07T20:32:30.6550150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.6551004Z def test_silu_mul_quant( 2025-05-07T20:32:30.6551472Z self, 2025-05-07T20:32:30.6551858Z T: int, 2025-05-07T20:32:30.6552243Z D: int, 2025-05-07T20:32:30.6552660Z scale_ub: Optional[float], 2025-05-07T20:32:30.6553188Z contiguous: bool, 2025-05-07T20:32:30.6553646Z compiled: bool, 2025-05-07T20:32:30.6553864Z ) -> None: 2025-05-07T20:32:30.6554080Z torch.manual_seed(2025) 2025-05-07T20:32:30.6554316Z 2025-05-07T20:32:30.6554581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.6554926Z 2025-05-07T20:32:30.6555126Z x_sign = torch.sign(x) 2025-05-07T20:32:30.6555411Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.6555729Z x = x_sign * x_clamp 2025-05-07T20:32:30.6555976Z x0 = x[:, :D] 2025-05-07T20:32:30.6556187Z x1 = x[:, D:] 2025-05-07T20:32:30.6556395Z 2025-05-07T20:32:30.6556579Z if contiguous: 2025-05-07T20:32:30.6556801Z x0 = x0.contiguous() 2025-05-07T20:32:30.6557058Z x1 = x1.contiguous() 2025-05-07T20:32:30.6557296Z 2025-05-07T20:32:30.6557489Z if scale_ub is not None: 2025-05-07T20:32:30.6557758Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.6558095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.6558572Z ) 2025-05-07T20:32:30.6558764Z else: 2025-05-07T20:32:30.6558978Z scale_ub_tensor = None 2025-05-07T20:32:30.6559455Z 2025-05-07T20:32:30.6559717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.6560026Z op = silu_mul_quant 2025-05-07T20:32:30.6560274Z if compiled: 2025-05-07T20:32:30.6560521Z op = torch.compile(op) 2025-05-07T20:32:30.6560947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6561220Z 2025-05-07T20:32:30.6561409Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.6561571Z 2025-05-07T20:32:30.6561669Z moe/activation_test.py:117: 2025-05-07T20:32:30.6561964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6562292Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.6562563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.6563131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.6563688Z return fn(*args, **kwargs) 2025-05-07T20:32:30.6564332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.6564998Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.6565523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.6566198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.6566840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.6567363Z kernel = self.compile( 2025-05-07T20:32:30.6567905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.6568544Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.6568944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.6569173Z 2025-05-07T20:32:30.6569379Z self = 2025-05-07T20:32:30.6570446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.6571799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102dccf40>} 2025-05-07T20:32:30.6573180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.6574189Z context = 2025-05-07T20:32:30.6574475Z 2025-05-07T20:32:30.6574639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.6575145Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.6575595Z module_map=module_map) 2025-05-07T20:32:30.6575957Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.6576310Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.6576562Z E ^ 2025-05-07T20:32:30.6577014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.6577458Z 2025-05-07T20:32:30.6577868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.6578366Z 2025-05-07T20:32:30.6578485Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.6579013Z self=, 2025-05-07T20:32:30.6579408Z T=1, 2025-05-07T20:32:30.6579592Z D=5120, 2025-05-07T20:32:30.6579781Z scale_ub=None, 2025-05-07T20:32:30.6579992Z contiguous=False, 2025-05-07T20:32:30.6580222Z compiled=True, 2025-05-07T20:32:30.6580425Z ) 2025-05-07T20:32:30.7195283Z self = 2025-05-07T20:32:30.7196554Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.7197044Z 2025-05-07T20:32:30.7197194Z @given( 2025-05-07T20:32:30.7197622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.7198206Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.7198763Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.7199378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.7199994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.7200523Z ) 2025-05-07T20:32:30.7201180Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.7201992Z def test_silu_mul_quant( 2025-05-07T20:32:30.7202441Z self, 2025-05-07T20:32:30.7202807Z T: int, 2025-05-07T20:32:30.7203171Z D: int, 2025-05-07T20:32:30.7203574Z scale_ub: Optional[float], 2025-05-07T20:32:30.7204032Z contiguous: bool, 2025-05-07T20:32:30.7204297Z compiled: bool, 2025-05-07T20:32:30.7204522Z ) -> None: 2025-05-07T20:32:30.7204748Z torch.manual_seed(2025) 2025-05-07T20:32:30.7204995Z 2025-05-07T20:32:30.7205271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.7205731Z 2025-05-07T20:32:30.7205981Z x_sign = torch.sign(x) 2025-05-07T20:32:30.7206334Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.7206644Z x = x_sign * x_clamp 2025-05-07T20:32:30.7206891Z x0 = x[:, :D] 2025-05-07T20:32:30.7207123Z x1 = x[:, D:] 2025-05-07T20:32:30.7207331Z 2025-05-07T20:32:30.7207516Z if contiguous: 2025-05-07T20:32:30.7207750Z x0 = x0.contiguous() 2025-05-07T20:32:30.7208000Z x1 = x1.contiguous() 2025-05-07T20:32:30.7208241Z 2025-05-07T20:32:30.7208430Z if scale_ub is not None: 2025-05-07T20:32:30.7208694Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.7209035Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.7209341Z ) 2025-05-07T20:32:30.7209526Z else: 2025-05-07T20:32:30.7209735Z scale_ub_tensor = None 2025-05-07T20:32:30.7209980Z 2025-05-07T20:32:30.7210202Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.7210516Z op = silu_mul_quant 2025-05-07T20:32:30.7210766Z if compiled: 2025-05-07T20:32:30.7211026Z op = torch.compile(op) 2025-05-07T20:32:30.7211320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.7211599Z 2025-05-07T20:32:30.7211800Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.7212082Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.7212373Z 2025-05-07T20:32:30.7212610Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.7212938Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.7213331Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.7213647Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.7213999Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.7214311Z 2025-05-07T20:32:30.7214510Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.7214701Z 2025-05-07T20:32:30.7214809Z moe/activation_test.py:126: 2025-05-07T20:32:30.7215101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.7215432Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.7215908Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.7216689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.7217437Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.7217984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.7218760Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.7219431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.7220141Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.7220861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.7221496Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.7222084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.7222592Z fn() 2025-05-07T20:32:30.7223094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.7223664Z self.fn.run( 2025-05-07T20:32:30.7224137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.7224658Z kernel = self.compile( 2025-05-07T20:32:30.7225183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.7225826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.7226222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.7226446Z 2025-05-07T20:32:30.7226700Z self = 2025-05-07T20:32:30.7227895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.7229254Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102dcef20>} 2025-05-07T20:32:30.7230588Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.7231600Z context = 2025-05-07T20:32:30.7231881Z 2025-05-07T20:32:30.7232050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.7232572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.7233042Z module_map=module_map) 2025-05-07T20:32:30.7233410Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.7233786Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.7234078Z E ^ 2025-05-07T20:32:30.7234538Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.7234986Z 2025-05-07T20:32:30.7235402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.7235906Z 2025-05-07T20:32:30.7236009Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.7239904Z self=, 2025-05-07T20:32:30.7240307Z T=1, 2025-05-07T20:32:30.7240489Z D=5120, 2025-05-07T20:32:30.7240792Z scale_ub=None, 2025-05-07T20:32:30.7241006Z contiguous=True, 2025-05-07T20:32:30.7241225Z compiled=False, 2025-05-07T20:32:30.7241425Z ) 2025-05-07T20:32:30.8736576Z self = 2025-05-07T20:32:30.8737146Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:30.8737520Z 2025-05-07T20:32:30.8737611Z @given( 2025-05-07T20:32:30.8737973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.8738288Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.8738599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.8738925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.8739255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.8739562Z ) 2025-05-07T20:32:30.8739913Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.8740365Z def test_silu_mul_quant( 2025-05-07T20:32:30.8740618Z self, 2025-05-07T20:32:30.8740817Z T: int, 2025-05-07T20:32:30.8741023Z D: int, 2025-05-07T20:32:30.8741249Z scale_ub: Optional[float], 2025-05-07T20:32:30.8741522Z contiguous: bool, 2025-05-07T20:32:30.8741771Z compiled: bool, 2025-05-07T20:32:30.8742003Z ) -> None: 2025-05-07T20:32:30.8742223Z torch.manual_seed(2025) 2025-05-07T20:32:30.8742480Z 2025-05-07T20:32:30.8742757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.8743093Z 2025-05-07T20:32:30.8743303Z x_sign = torch.sign(x) 2025-05-07T20:32:30.8743597Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.8743913Z x = x_sign * x_clamp 2025-05-07T20:32:30.8744216Z x0 = x[:, :D] 2025-05-07T20:32:30.8744444Z x1 = x[:, D:] 2025-05-07T20:32:30.8744665Z 2025-05-07T20:32:30.8744859Z if contiguous: 2025-05-07T20:32:30.8745113Z x0 = x0.contiguous() 2025-05-07T20:32:30.8745382Z x1 = x1.contiguous() 2025-05-07T20:32:30.8745622Z 2025-05-07T20:32:30.8745822Z if scale_ub is not None: 2025-05-07T20:32:30.8746094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.8746424Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.8746734Z ) 2025-05-07T20:32:30.8746936Z else: 2025-05-07T20:32:30.8747156Z scale_ub_tensor = None 2025-05-07T20:32:30.8747421Z 2025-05-07T20:32:30.8747661Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.8747974Z op = silu_mul_quant 2025-05-07T20:32:30.8748235Z if compiled: 2025-05-07T20:32:30.8755162Z op = torch.compile(op) 2025-05-07T20:32:30.8755504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8755785Z 2025-05-07T20:32:30.8755982Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.8756145Z 2025-05-07T20:32:30.8756249Z moe/activation_test.py:117: 2025-05-07T20:32:30.8756546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8756874Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.8757148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8757834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.8758516Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.8759046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.8759964Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.8760617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.8761299Z kernel = self.compile( 2025-05-07T20:32:30.8761945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.8762595Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.8762993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8763221Z 2025-05-07T20:32:30.8763435Z self = 2025-05-07T20:32:30.8764540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.8765954Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102dcfa60>} 2025-05-07T20:32:30.8767276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.8768285Z context = 2025-05-07T20:32:30.8768566Z 2025-05-07T20:32:30.8768736Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.8769244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.8769717Z module_map=module_map) 2025-05-07T20:32:30.8770085Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.8770430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.8770690Z E ^ 2025-05-07T20:32:30.8771157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.8771601Z 2025-05-07T20:32:30.8772020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.8772518Z 2025-05-07T20:32:30.8772622Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.8773092Z self=, 2025-05-07T20:32:30.8773491Z T=128, 2025-05-07T20:32:30.8773679Z D=5120, 2025-05-07T20:32:30.8773875Z scale_ub=None, 2025-05-07T20:32:30.8774125Z contiguous=False, 2025-05-07T20:32:30.8774368Z compiled=True, 2025-05-07T20:32:30.8774576Z ) 2025-05-07T20:32:30.8774894Z self = 2025-05-07T20:32:30.8775385Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.8775645Z 2025-05-07T20:32:30.8775724Z @given( 2025-05-07T20:32:30.8775964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.8776278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.8776576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.8776905Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.8777228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.8777507Z ) 2025-05-07T20:32:30.8777858Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.8778295Z def test_silu_mul_quant( 2025-05-07T20:32:30.8778537Z self, 2025-05-07T20:32:30.8778726Z T: int, 2025-05-07T20:32:30.8778927Z D: int, 2025-05-07T20:32:30.8779144Z scale_ub: Optional[float], 2025-05-07T20:32:30.8779411Z contiguous: bool, 2025-05-07T20:32:30.8779653Z compiled: bool, 2025-05-07T20:32:30.8779874Z ) -> None: 2025-05-07T20:32:30.8780106Z torch.manual_seed(2025) 2025-05-07T20:32:30.8780348Z 2025-05-07T20:32:30.8780622Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.8781035Z 2025-05-07T20:32:30.8781228Z x_sign = torch.sign(x) 2025-05-07T20:32:30.8781589Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.8781901Z x = x_sign * x_clamp 2025-05-07T20:32:30.8782139Z x0 = x[:, :D] 2025-05-07T20:32:30.8782355Z x1 = x[:, D:] 2025-05-07T20:32:30.8782562Z 2025-05-07T20:32:30.8782743Z if contiguous: 2025-05-07T20:32:30.8782976Z x0 = x0.contiguous() 2025-05-07T20:32:30.8783236Z x1 = x1.contiguous() 2025-05-07T20:32:30.8783519Z 2025-05-07T20:32:30.8783741Z if scale_ub is not None: 2025-05-07T20:32:30.8784042Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.8784378Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.8784673Z ) 2025-05-07T20:32:30.8784873Z else: 2025-05-07T20:32:30.8785088Z scale_ub_tensor = None 2025-05-07T20:32:30.8785336Z 2025-05-07T20:32:30.8785569Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.8785885Z op = silu_mul_quant 2025-05-07T20:32:30.8786139Z if compiled: 2025-05-07T20:32:30.8786388Z op = torch.compile(op) 2025-05-07T20:32:30.8786689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8786955Z 2025-05-07T20:32:30.8787160Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.8787320Z 2025-05-07T20:32:30.8787421Z moe/activation_test.py:117: 2025-05-07T20:32:30.8787705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8788045Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.8788331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.8788891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:30.8789442Z return fn(*args, **kwargs) 2025-05-07T20:32:30.8790096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.8790782Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.8791306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.8791972Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.8792622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.8793146Z kernel = self.compile( 2025-05-07T20:32:30.8793791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.8794595Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.8795081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.8795360Z 2025-05-07T20:32:30.8795641Z self = 2025-05-07T20:32:30.8796784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.8798121Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10201d1c0>} 2025-05-07T20:32:30.8799432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.8800442Z context = 2025-05-07T20:32:30.8800723Z 2025-05-07T20:32:30.8800888Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.8801458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.8801990Z module_map=module_map) 2025-05-07T20:32:30.8802353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.8802701Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.8802950Z E ^ 2025-05-07T20:32:30.8803447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.8803996Z 2025-05-07T20:32:30.8804558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.8805185Z 2025-05-07T20:32:30.8805317Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.8805778Z self=, 2025-05-07T20:32:30.8806171Z T=128, 2025-05-07T20:32:30.8806360Z D=7168, 2025-05-07T20:32:30.8806554Z scale_ub=1200.0, 2025-05-07T20:32:30.8806787Z contiguous=False, 2025-05-07T20:32:30.8807012Z compiled=False, 2025-05-07T20:32:30.8807222Z ) 2025-05-07T20:32:30.9939543Z self = 2025-05-07T20:32:30.9940147Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:30.9940540Z 2025-05-07T20:32:30.9940657Z @given( 2025-05-07T20:32:30.9940912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.9941232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.9941567Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.9941909Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.9942243Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.9942545Z ) 2025-05-07T20:32:30.9942906Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.9943365Z def test_silu_mul_quant( 2025-05-07T20:32:30.9943614Z self, 2025-05-07T20:32:30.9943819Z T: int, 2025-05-07T20:32:30.9944033Z D: int, 2025-05-07T20:32:30.9944260Z scale_ub: Optional[float], 2025-05-07T20:32:30.9944542Z contiguous: bool, 2025-05-07T20:32:30.9944793Z compiled: bool, 2025-05-07T20:32:30.9945027Z ) -> None: 2025-05-07T20:32:30.9945255Z torch.manual_seed(2025) 2025-05-07T20:32:30.9945512Z 2025-05-07T20:32:30.9945790Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.9946149Z 2025-05-07T20:32:30.9946360Z x_sign = torch.sign(x) 2025-05-07T20:32:30.9946653Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.9946978Z x = x_sign * x_clamp 2025-05-07T20:32:30.9947230Z x0 = x[:, :D] 2025-05-07T20:32:30.9947452Z x1 = x[:, D:] 2025-05-07T20:32:30.9947673Z 2025-05-07T20:32:30.9947871Z if contiguous: 2025-05-07T20:32:30.9948115Z x0 = x0.contiguous() 2025-05-07T20:32:30.9948388Z x1 = x1.contiguous() 2025-05-07T20:32:30.9948638Z 2025-05-07T20:32:30.9948840Z if scale_ub is not None: 2025-05-07T20:32:30.9949121Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.9949462Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.9949777Z ) 2025-05-07T20:32:30.9949975Z else: 2025-05-07T20:32:30.9950199Z scale_ub_tensor = None 2025-05-07T20:32:30.9950461Z 2025-05-07T20:32:30.9950700Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9951024Z op = silu_mul_quant 2025-05-07T20:32:30.9951290Z if compiled: 2025-05-07T20:32:30.9951545Z op = torch.compile(op) 2025-05-07T20:32:30.9951856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9952136Z 2025-05-07T20:32:30.9952334Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.9952711Z 2025-05-07T20:32:30.9952815Z moe/activation_test.py:117: 2025-05-07T20:32:30.9953122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9953600Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.9953889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9954587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.9955276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.9955811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.9956636Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.9957296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.9957833Z kernel = self.compile( 2025-05-07T20:32:30.9958389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.9959050Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.9959643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9959877Z 2025-05-07T20:32:30.9960090Z self = 2025-05-07T20:32:30.9961164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.9962526Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10201cd60>} 2025-05-07T20:32:30.9963868Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.9964929Z context = 2025-05-07T20:32:30.9965231Z 2025-05-07T20:32:30.9965403Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.9965935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.9966406Z module_map=module_map) 2025-05-07T20:32:30.9966787Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.9967154Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.9967429Z E ^ 2025-05-07T20:32:30.9967897Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.9968352Z 2025-05-07T20:32:30.9968770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.9969281Z 2025-05-07T20:32:30.9969393Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.9969821Z self=, 2025-05-07T20:32:30.9970225Z T=128, 2025-05-07T20:32:30.9970418Z D=5120, 2025-05-07T20:32:30.9970626Z scale_ub=None, 2025-05-07T20:32:30.9970842Z contiguous=False, 2025-05-07T20:32:30.9971081Z compiled=False, 2025-05-07T20:32:30.9971298Z ) 2025-05-07T20:32:30.9971620Z self = 2025-05-07T20:32:30.9972122Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:30.9972393Z 2025-05-07T20:32:30.9972479Z @given( 2025-05-07T20:32:30.9972721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.9973112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.9973447Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.9973871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.9974358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.9974686Z ) 2025-05-07T20:32:30.9975056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.9975508Z def test_silu_mul_quant( 2025-05-07T20:32:30.9975772Z self, 2025-05-07T20:32:30.9975996Z T: int, 2025-05-07T20:32:30.9976211Z D: int, 2025-05-07T20:32:30.9976450Z scale_ub: Optional[float], 2025-05-07T20:32:30.9976804Z contiguous: bool, 2025-05-07T20:32:30.9977061Z compiled: bool, 2025-05-07T20:32:30.9977305Z ) -> None: 2025-05-07T20:32:30.9977542Z torch.manual_seed(2025) 2025-05-07T20:32:30.9977799Z 2025-05-07T20:32:30.9978090Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.9978447Z 2025-05-07T20:32:30.9978656Z x_sign = torch.sign(x) 2025-05-07T20:32:30.9978971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.9979308Z x = x_sign * x_clamp 2025-05-07T20:32:30.9979575Z x0 = x[:, :D] 2025-05-07T20:32:30.9979812Z x1 = x[:, D:] 2025-05-07T20:32:30.9980045Z 2025-05-07T20:32:30.9980262Z if contiguous: 2025-05-07T20:32:30.9980515Z x0 = x0.contiguous() 2025-05-07T20:32:30.9980796Z x1 = x1.contiguous() 2025-05-07T20:32:30.9981061Z 2025-05-07T20:32:30.9981276Z if scale_ub is not None: 2025-05-07T20:32:30.9981570Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.9981927Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.9982249Z ) 2025-05-07T20:32:30.9982461Z else: 2025-05-07T20:32:30.9982694Z scale_ub_tensor = None 2025-05-07T20:32:30.9982957Z 2025-05-07T20:32:30.9983215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9983548Z op = silu_mul_quant 2025-05-07T20:32:30.9983816Z if compiled: 2025-05-07T20:32:30.9984127Z op = torch.compile(op) 2025-05-07T20:32:30.9984455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9984760Z 2025-05-07T20:32:30.9984968Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.9985147Z 2025-05-07T20:32:30.9985254Z moe/activation_test.py:117: 2025-05-07T20:32:30.9985568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9985907Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.9986215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9986912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.9987617Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.9988170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.9988869Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.9989541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.9990080Z kernel = self.compile( 2025-05-07T20:32:30.9990637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.9991300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.9991707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9991939Z 2025-05-07T20:32:30.9992154Z self = 2025-05-07T20:32:30.9993227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.9994717Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10201e2a0>} 2025-05-07T20:32:30.9996058Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.9997076Z context = 2025-05-07T20:32:30.9997430Z 2025-05-07T20:32:30.9997600Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.9998137Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.9998614Z module_map=module_map) 2025-05-07T20:32:30.9998992Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.9999369Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.9999652Z E ^ 2025-05-07T20:32:31.0000152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.0000609Z 2025-05-07T20:32:31.0001027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.0001555Z 2025-05-07T20:32:31.0001668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.0002097Z self=, 2025-05-07T20:32:31.0002526Z T=128, 2025-05-07T20:32:31.0002726Z D=5120, 2025-05-07T20:32:31.0002943Z scale_ub=1200.0, 2025-05-07T20:32:31.0003192Z contiguous=True, 2025-05-07T20:32:31.0003431Z compiled=False, 2025-05-07T20:32:31.0003662Z ) 2025-05-07T20:32:31.1751813Z self = 2025-05-07T20:32:31.1752398Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.1752696Z 2025-05-07T20:32:31.1752786Z @given( 2025-05-07T20:32:31.1753051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.1753383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.1753703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.1754046Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.1754401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.1754746Z ) 2025-05-07T20:32:31.1755110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.1755572Z def test_silu_mul_quant( 2025-05-07T20:32:31.1755830Z self, 2025-05-07T20:32:31.1756032Z T: int, 2025-05-07T20:32:31.1756246Z D: int, 2025-05-07T20:32:31.1756476Z scale_ub: Optional[float], 2025-05-07T20:32:31.1756758Z contiguous: bool, 2025-05-07T20:32:31.1757012Z compiled: bool, 2025-05-07T20:32:31.1757251Z ) -> None: 2025-05-07T20:32:31.1757473Z torch.manual_seed(2025) 2025-05-07T20:32:31.1757727Z 2025-05-07T20:32:31.1758020Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.1758367Z 2025-05-07T20:32:31.1758573Z x_sign = torch.sign(x) 2025-05-07T20:32:31.1758882Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.1759684Z x = x_sign * x_clamp 2025-05-07T20:32:31.1759947Z x0 = x[:, :D] 2025-05-07T20:32:31.1760171Z x1 = x[:, D:] 2025-05-07T20:32:31.1760382Z 2025-05-07T20:32:31.1760584Z if contiguous: 2025-05-07T20:32:31.1760830Z x0 = x0.contiguous() 2025-05-07T20:32:31.1761107Z x1 = x1.contiguous() 2025-05-07T20:32:31.1761349Z 2025-05-07T20:32:31.1761554Z if scale_ub is not None: 2025-05-07T20:32:31.1761837Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.1762175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.1762656Z ) 2025-05-07T20:32:31.1762864Z else: 2025-05-07T20:32:31.1763238Z scale_ub_tensor = None 2025-05-07T20:32:31.1763501Z 2025-05-07T20:32:31.1763752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.1764070Z op = silu_mul_quant 2025-05-07T20:32:31.1764327Z if compiled: 2025-05-07T20:32:31.1764591Z op = torch.compile(op) 2025-05-07T20:32:31.1764930Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1765223Z 2025-05-07T20:32:31.1765504Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.1765670Z 2025-05-07T20:32:31.1765771Z moe/activation_test.py:117: 2025-05-07T20:32:31.1766088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1766430Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.1766723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1767418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.1768194Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.1768796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.1780010Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.1780703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.1781251Z kernel = self.compile( 2025-05-07T20:32:31.1781794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.1782450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1782853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1783085Z 2025-05-07T20:32:31.1783292Z self = 2025-05-07T20:32:31.1784375Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.1785746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10201f380>} 2025-05-07T20:32:31.1787079Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.1788097Z context = 2025-05-07T20:32:31.1788383Z 2025-05-07T20:32:31.1788561Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.1789077Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1789553Z module_map=module_map) 2025-05-07T20:32:31.1789930Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1790282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1790550Z E ^ 2025-05-07T20:32:31.1791021Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.1791469Z 2025-05-07T20:32:31.1791887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.1792395Z 2025-05-07T20:32:31.1792500Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.1792916Z self=, 2025-05-07T20:32:31.1793320Z T=1, 2025-05-07T20:32:31.1793601Z D=7168, 2025-05-07T20:32:31.1793804Z scale_ub=1200.0, 2025-05-07T20:32:31.1794036Z contiguous=True, 2025-05-07T20:32:31.1794259Z compiled=True, 2025-05-07T20:32:31.1794616Z ) 2025-05-07T20:32:31.1794944Z self = 2025-05-07T20:32:31.1795429Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:31.1795698Z 2025-05-07T20:32:31.1795780Z @given( 2025-05-07T20:32:31.1796023Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.1796342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.1796697Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.1797033Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.1797371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.1797651Z ) 2025-05-07T20:32:31.1798006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.1798457Z def test_silu_mul_quant( 2025-05-07T20:32:31.1798705Z self, 2025-05-07T20:32:31.1798917Z T: int, 2025-05-07T20:32:31.1799135Z D: int, 2025-05-07T20:32:31.1799362Z scale_ub: Optional[float], 2025-05-07T20:32:31.1799635Z contiguous: bool, 2025-05-07T20:32:31.1799887Z compiled: bool, 2025-05-07T20:32:31.1800121Z ) -> None: 2025-05-07T20:32:31.1800348Z torch.manual_seed(2025) 2025-05-07T20:32:31.1800604Z 2025-05-07T20:32:31.1800886Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.1801230Z 2025-05-07T20:32:31.1801438Z x_sign = torch.sign(x) 2025-05-07T20:32:31.1801736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.1802046Z x = x_sign * x_clamp 2025-05-07T20:32:31.1802299Z x0 = x[:, :D] 2025-05-07T20:32:31.1802526Z x1 = x[:, D:] 2025-05-07T20:32:31.1802737Z 2025-05-07T20:32:31.1802933Z if contiguous: 2025-05-07T20:32:31.1803176Z x0 = x0.contiguous() 2025-05-07T20:32:31.1803432Z x1 = x1.contiguous() 2025-05-07T20:32:31.1803681Z 2025-05-07T20:32:31.1803889Z if scale_ub is not None: 2025-05-07T20:32:31.1804164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.1804512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.1804878Z ) 2025-05-07T20:32:31.1805080Z else: 2025-05-07T20:32:31.1805294Z scale_ub_tensor = None 2025-05-07T20:32:31.1805553Z 2025-05-07T20:32:31.1805792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.1806110Z op = silu_mul_quant 2025-05-07T20:32:31.1806365Z if compiled: 2025-05-07T20:32:31.1806620Z op = torch.compile(op) 2025-05-07T20:32:31.1806914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1807197Z 2025-05-07T20:32:31.1807401Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.1807569Z 2025-05-07T20:32:31.1807669Z moe/activation_test.py:117: 2025-05-07T20:32:31.1807977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1808314Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.1808604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1809158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.1809723Z return fn(*args, **kwargs) 2025-05-07T20:32:31.1810383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.1811067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.1811604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.1812280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.1813061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.1813587Z kernel = self.compile( 2025-05-07T20:32:31.1814222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.1814876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1815276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1815510Z 2025-05-07T20:32:31.1815718Z self = 2025-05-07T20:32:31.1816829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.1818185Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0173c4a40>} 2025-05-07T20:32:31.1819519Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.1820531Z context = 2025-05-07T20:32:31.1820824Z 2025-05-07T20:32:31.1820993Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.1821513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1821987Z module_map=module_map) 2025-05-07T20:32:31.1822354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1822715Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1822981Z E ^ 2025-05-07T20:32:31.1823441Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.1823890Z 2025-05-07T20:32:31.1824359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.1824874Z 2025-05-07T20:32:31.1824982Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.1825409Z self=, 2025-05-07T20:32:31.1825806Z T=1, 2025-05-07T20:32:31.1826000Z D=7168, 2025-05-07T20:32:31.1826203Z scale_ub=1200.0, 2025-05-07T20:32:31.1826428Z contiguous=False, 2025-05-07T20:32:31.1826664Z compiled=True, 2025-05-07T20:32:31.1826881Z ) 2025-05-07T20:32:31.3144852Z self = 2025-05-07T20:32:31.3145400Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.3145681Z 2025-05-07T20:32:31.3145763Z @given( 2025-05-07T20:32:31.3146020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.3146342Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.3146663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.3147006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.3147346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.3147633Z ) 2025-05-07T20:32:31.3147995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.3148448Z def test_silu_mul_quant( 2025-05-07T20:32:31.3148699Z self, 2025-05-07T20:32:31.3148909Z T: int, 2025-05-07T20:32:31.3149119Z D: int, 2025-05-07T20:32:31.3149346Z scale_ub: Optional[float], 2025-05-07T20:32:31.3149628Z contiguous: bool, 2025-05-07T20:32:31.3149878Z compiled: bool, 2025-05-07T20:32:31.3150106Z ) -> None: 2025-05-07T20:32:31.3150335Z torch.manual_seed(2025) 2025-05-07T20:32:31.3150872Z 2025-05-07T20:32:31.3151153Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.3151500Z 2025-05-07T20:32:31.3151853Z x_sign = torch.sign(x) 2025-05-07T20:32:31.3152158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.3152473Z x = x_sign * x_clamp 2025-05-07T20:32:31.3152735Z x0 = x[:, :D] 2025-05-07T20:32:31.3152966Z x1 = x[:, D:] 2025-05-07T20:32:31.3153183Z 2025-05-07T20:32:31.3153386Z if contiguous: 2025-05-07T20:32:31.3153636Z x0 = x0.contiguous() 2025-05-07T20:32:31.3153985Z x1 = x1.contiguous() 2025-05-07T20:32:31.3154244Z 2025-05-07T20:32:31.3154453Z if scale_ub is not None: 2025-05-07T20:32:31.3154731Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.3155128Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.3155439Z ) 2025-05-07T20:32:31.3155636Z else: 2025-05-07T20:32:31.3155865Z scale_ub_tensor = None 2025-05-07T20:32:31.3156124Z 2025-05-07T20:32:31.3156363Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.3156685Z op = silu_mul_quant 2025-05-07T20:32:31.3156944Z if compiled: 2025-05-07T20:32:31.3157206Z op = torch.compile(op) 2025-05-07T20:32:31.3157502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.3157788Z 2025-05-07T20:32:31.3157997Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.3158165Z 2025-05-07T20:32:31.3158270Z moe/activation_test.py:117: 2025-05-07T20:32:31.3158580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.3158917Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.3159437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.3160042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.3160613Z return fn(*args, **kwargs) 2025-05-07T20:32:31.3161281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.3161964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.3162505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.3163187Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.3163855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.3164389Z kernel = self.compile( 2025-05-07T20:32:31.3164936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.3165610Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.3166012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.3166250Z 2025-05-07T20:32:31.3166461Z self = 2025-05-07T20:32:31.3167543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.3168927Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0173c60c0>} 2025-05-07T20:32:31.3170262Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.3171275Z context = 2025-05-07T20:32:31.3171657Z 2025-05-07T20:32:31.3171827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.3172490Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.3172960Z module_map=module_map) 2025-05-07T20:32:31.3173413Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.3173777Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.3174044Z E ^ 2025-05-07T20:32:31.3174508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.3175027Z 2025-05-07T20:32:31.3175442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.3175947Z 2025-05-07T20:32:31.3176064Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.3176487Z self=, 2025-05-07T20:32:31.3176889Z T=1, 2025-05-07T20:32:31.3177089Z D=7168, 2025-05-07T20:32:31.3177295Z scale_ub=None, 2025-05-07T20:32:31.3177517Z contiguous=False, 2025-05-07T20:32:31.3177751Z compiled=True, 2025-05-07T20:32:31.3177970Z ) 2025-05-07T20:32:31.5845861Z self = 2025-05-07T20:32:31.5846915Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:31.5847449Z 2025-05-07T20:32:31.5847610Z @given( 2025-05-07T20:32:31.5848071Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.5848714Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.5849320Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.5849974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.5850621Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.5851179Z ) 2025-05-07T20:32:31.5851874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.5852759Z def test_silu_mul_quant( 2025-05-07T20:32:31.5853361Z self, 2025-05-07T20:32:31.5853755Z T: int, 2025-05-07T20:32:31.5854086Z D: int, 2025-05-07T20:32:31.5854309Z scale_ub: Optional[float], 2025-05-07T20:32:31.5854576Z contiguous: bool, 2025-05-07T20:32:31.5854824Z compiled: bool, 2025-05-07T20:32:31.5855056Z ) -> None: 2025-05-07T20:32:31.5855271Z torch.manual_seed(2025) 2025-05-07T20:32:31.5855525Z 2025-05-07T20:32:31.5855804Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.5856144Z 2025-05-07T20:32:31.5856342Z x_sign = torch.sign(x) 2025-05-07T20:32:31.5856647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.5856954Z x = x_sign * x_clamp 2025-05-07T20:32:31.5857203Z x0 = x[:, :D] 2025-05-07T20:32:31.5857435Z x1 = x[:, D:] 2025-05-07T20:32:31.5857647Z 2025-05-07T20:32:31.5857844Z if contiguous: 2025-05-07T20:32:31.5858088Z x0 = x0.contiguous() 2025-05-07T20:32:31.5858357Z x1 = x1.contiguous() 2025-05-07T20:32:31.5858609Z 2025-05-07T20:32:31.5858811Z if scale_ub is not None: 2025-05-07T20:32:31.5859095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.5859665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.5859986Z ) 2025-05-07T20:32:31.5860185Z else: 2025-05-07T20:32:31.5860394Z scale_ub_tensor = None 2025-05-07T20:32:31.5860656Z 2025-05-07T20:32:31.5860896Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.5861211Z op = silu_mul_quant 2025-05-07T20:32:31.5861465Z if compiled: 2025-05-07T20:32:31.5861719Z op = torch.compile(op) 2025-05-07T20:32:31.5862011Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.5862287Z 2025-05-07T20:32:31.5862785Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.5863066Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.5863359Z 2025-05-07T20:32:31.5863752Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.5864092Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.5864384Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.5864698Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.5865061Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.5865367Z 2025-05-07T20:32:31.5865668Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.5865862Z 2025-05-07T20:32:31.5865970Z moe/activation_test.py:126: 2025-05-07T20:32:31.5866266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5866602Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.5866935Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.5867739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.5868485Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.5869034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.5869719Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.5870399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.5871120Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.5871843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.5872478Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.5873076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.5873586Z fn() 2025-05-07T20:32:31.5874099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.5874672Z self.fn.run( 2025-05-07T20:32:31.5875139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.5875668Z kernel = self.compile( 2025-05-07T20:32:31.5876218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.5876870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.5877275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.5877510Z 2025-05-07T20:32:31.5877723Z self = 2025-05-07T20:32:31.5878799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.5880168Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0173c6de0>} 2025-05-07T20:32:31.5881494Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.5882519Z context = 2025-05-07T20:32:31.5882807Z 2025-05-07T20:32:31.5882984Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.5883511Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.5884049Z module_map=module_map) 2025-05-07T20:32:31.5884560Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.5884933Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.5885204Z E ^ 2025-05-07T20:32:31.5885671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.5886119Z 2025-05-07T20:32:31.5886538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.5887086Z 2025-05-07T20:32:31.5887204Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.5887621Z self=, 2025-05-07T20:32:31.5888029Z T=1, 2025-05-07T20:32:31.5888223Z D=5120, 2025-05-07T20:32:31.5888421Z scale_ub=1200.0, 2025-05-07T20:32:31.5888661Z contiguous=False, 2025-05-07T20:32:31.5888894Z compiled=True, 2025-05-07T20:32:31.5889105Z ) 2025-05-07T20:32:31.7421386Z self = 2025-05-07T20:32:31.7421936Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.7422268Z 2025-05-07T20:32:31.7422379Z @given( 2025-05-07T20:32:31.7422705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.7423043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7423375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7423731Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7424076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7424374Z ) 2025-05-07T20:32:31.7424734Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7425187Z def test_silu_mul_quant( 2025-05-07T20:32:31.7425444Z self, 2025-05-07T20:32:31.7425670Z T: int, 2025-05-07T20:32:31.7425891Z D: int, 2025-05-07T20:32:31.7426122Z scale_ub: Optional[float], 2025-05-07T20:32:31.7426411Z contiguous: bool, 2025-05-07T20:32:31.7426666Z compiled: bool, 2025-05-07T20:32:31.7426904Z ) -> None: 2025-05-07T20:32:31.7427136Z torch.manual_seed(2025) 2025-05-07T20:32:31.7427397Z 2025-05-07T20:32:31.7427675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7428036Z 2025-05-07T20:32:31.7428249Z x_sign = torch.sign(x) 2025-05-07T20:32:31.7428566Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.7428887Z x = x_sign * x_clamp 2025-05-07T20:32:31.7429147Z x0 = x[:, :D] 2025-05-07T20:32:31.7429385Z x1 = x[:, D:] 2025-05-07T20:32:31.7429613Z 2025-05-07T20:32:31.7429818Z if contiguous: 2025-05-07T20:32:31.7430072Z x0 = x0.contiguous() 2025-05-07T20:32:31.7430347Z x1 = x1.contiguous() 2025-05-07T20:32:31.7430615Z 2025-05-07T20:32:31.7430829Z if scale_ub is not None: 2025-05-07T20:32:31.7431129Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.7431491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.7431819Z ) 2025-05-07T20:32:31.7432021Z else: 2025-05-07T20:32:31.7432249Z scale_ub_tensor = None 2025-05-07T20:32:31.7432516Z 2025-05-07T20:32:31.7432756Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.7433080Z op = silu_mul_quant 2025-05-07T20:32:31.7433348Z if compiled: 2025-05-07T20:32:31.7433613Z op = torch.compile(op) 2025-05-07T20:32:31.7433918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.7434207Z 2025-05-07T20:32:31.7434413Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.7434581Z 2025-05-07T20:32:31.7434686Z moe/activation_test.py:117: 2025-05-07T20:32:31.7435277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7435615Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.7436047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.7436617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.7437181Z return fn(*args, **kwargs) 2025-05-07T20:32:31.7437844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.7438618Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.7439164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.7439846Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.7440504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.7441043Z kernel = self.compile( 2025-05-07T20:32:31.7441595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.7442246Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.7442660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7442893Z 2025-05-07T20:32:31.7443112Z self = 2025-05-07T20:32:31.7444178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.7445551Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017c68540>} 2025-05-07T20:32:31.7446889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.7447908Z context = 2025-05-07T20:32:31.7448199Z 2025-05-07T20:32:31.7448492Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.7449118Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.7449699Z module_map=module_map) 2025-05-07T20:32:31.7450374Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.7450834Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.7451178Z E ^ 2025-05-07T20:32:31.7451799Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.7452318Z 2025-05-07T20:32:31.7452772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.7453431Z 2025-05-07T20:32:31.7453598Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7454212Z self=, 2025-05-07T20:32:31.7454687Z T=1, 2025-05-07T20:32:31.7454988Z D=5120, 2025-05-07T20:32:31.7455356Z scale_ub=1200.0, 2025-05-07T20:32:31.7465146Z contiguous=False, 2025-05-07T20:32:31.7465416Z compiled=False, 2025-05-07T20:32:31.7465694Z ) 2025-05-07T20:32:31.7466023Z self = 2025-05-07T20:32:31.7466514Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:31.7466784Z 2025-05-07T20:32:31.7466864Z @given( 2025-05-07T20:32:31.7467099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.7467540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.7467842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.7468297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.7468628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.7468908Z ) 2025-05-07T20:32:31.7469255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.7469691Z def test_silu_mul_quant( 2025-05-07T20:32:31.7469928Z self, 2025-05-07T20:32:31.7470124Z T: int, 2025-05-07T20:32:31.7470390Z D: int, 2025-05-07T20:32:31.7470604Z scale_ub: Optional[float], 2025-05-07T20:32:31.7470876Z contiguous: bool, 2025-05-07T20:32:31.7471112Z compiled: bool, 2025-05-07T20:32:31.7471331Z ) -> None: 2025-05-07T20:32:31.7471545Z torch.manual_seed(2025) 2025-05-07T20:32:31.7471782Z 2025-05-07T20:32:31.7472050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.7472400Z 2025-05-07T20:32:31.7472596Z x_sign = torch.sign(x) 2025-05-07T20:32:31.7472898Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.7473202Z x = x_sign * x_clamp 2025-05-07T20:32:31.7473450Z x0 = x[:, :D] 2025-05-07T20:32:31.7473675Z x1 = x[:, D:] 2025-05-07T20:32:31.7473883Z 2025-05-07T20:32:31.7474075Z if contiguous: 2025-05-07T20:32:31.7474315Z x0 = x0.contiguous() 2025-05-07T20:32:31.7474567Z x1 = x1.contiguous() 2025-05-07T20:32:31.7474817Z 2025-05-07T20:32:31.7475017Z if scale_ub is not None: 2025-05-07T20:32:31.7475289Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.7475630Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.7475940Z ) 2025-05-07T20:32:31.7476126Z else: 2025-05-07T20:32:31.7476340Z scale_ub_tensor = None 2025-05-07T20:32:31.7476590Z 2025-05-07T20:32:31.7476820Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.7477133Z op = silu_mul_quant 2025-05-07T20:32:31.7477389Z if compiled: 2025-05-07T20:32:31.7477637Z op = torch.compile(op) 2025-05-07T20:32:31.7477926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.7478199Z 2025-05-07T20:32:31.7478399Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.7478563Z 2025-05-07T20:32:31.7478664Z moe/activation_test.py:117: 2025-05-07T20:32:31.7478960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7479297Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.7479567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.7480254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.7480939Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.7481476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.7482155Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.7482815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.7483340Z kernel = self.compile( 2025-05-07T20:32:31.7483871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.7484518Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.7484955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.7485202Z 2025-05-07T20:32:31.7485417Z self = 2025-05-07T20:32:31.7486476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.7487982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017dbb560>} 2025-05-07T20:32:31.7489308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.7490354Z context = 2025-05-07T20:32:31.7490635Z 2025-05-07T20:32:31.7490806Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.7491311Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.7491775Z module_map=module_map) 2025-05-07T20:32:31.7492149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.7492493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.7492777Z E ^ 2025-05-07T20:32:31.7493305Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.7493747Z 2025-05-07T20:32:31.7494162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.7494666Z 2025-05-07T20:32:31.7494784Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.7495231Z self=, 2025-05-07T20:32:31.7495628Z T=16384, 2025-05-07T20:32:31.7495828Z D=5120, 2025-05-07T20:32:31.7496028Z scale_ub=1200.0, 2025-05-07T20:32:31.7496257Z contiguous=False, 2025-05-07T20:32:31.7496487Z compiled=True, 2025-05-07T20:32:31.7496688Z ) 2025-05-07T20:32:31.8368242Z self = 2025-05-07T20:32:31.8368866Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.8369255Z 2025-05-07T20:32:31.8369378Z @given( 2025-05-07T20:32:31.8369692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8370125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8370459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8370793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8371122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8371416Z ) 2025-05-07T20:32:31.8371766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8372200Z def test_silu_mul_quant( 2025-05-07T20:32:31.8372446Z self, 2025-05-07T20:32:31.8372644Z T: int, 2025-05-07T20:32:31.8372842Z D: int, 2025-05-07T20:32:31.8373165Z scale_ub: Optional[float], 2025-05-07T20:32:31.8373443Z contiguous: bool, 2025-05-07T20:32:31.8373682Z compiled: bool, 2025-05-07T20:32:31.8373917Z ) -> None: 2025-05-07T20:32:31.8374137Z torch.manual_seed(2025) 2025-05-07T20:32:31.8374374Z 2025-05-07T20:32:31.8374649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8374997Z 2025-05-07T20:32:31.8375190Z x_sign = torch.sign(x) 2025-05-07T20:32:31.8375485Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.8375801Z x = x_sign * x_clamp 2025-05-07T20:32:31.8376050Z x0 = x[:, :D] 2025-05-07T20:32:31.8376263Z x1 = x[:, D:] 2025-05-07T20:32:31.8376483Z 2025-05-07T20:32:31.8376679Z if contiguous: 2025-05-07T20:32:31.8376913Z x0 = x0.contiguous() 2025-05-07T20:32:31.8377177Z x1 = x1.contiguous() 2025-05-07T20:32:31.8377418Z 2025-05-07T20:32:31.8377615Z if scale_ub is not None: 2025-05-07T20:32:31.8378199Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.8378540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.8378988Z ) 2025-05-07T20:32:31.8379193Z else: 2025-05-07T20:32:31.8379407Z scale_ub_tensor = None 2025-05-07T20:32:31.8379659Z 2025-05-07T20:32:31.8379897Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8380214Z op = silu_mul_quant 2025-05-07T20:32:31.8380465Z if compiled: 2025-05-07T20:32:31.8380719Z op = torch.compile(op) 2025-05-07T20:32:31.8381104Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8381388Z 2025-05-07T20:32:31.8381587Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.8381760Z 2025-05-07T20:32:31.8381865Z moe/activation_test.py:117: 2025-05-07T20:32:31.8382173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8382508Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.8382803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8383372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.8383923Z return fn(*args, **kwargs) 2025-05-07T20:32:31.8384588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.8385275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.8385814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.8386487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.8387152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.8387684Z kernel = self.compile( 2025-05-07T20:32:31.8388221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.8388881Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.8389286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8389513Z 2025-05-07T20:32:31.8389730Z self = 2025-05-07T20:32:31.8390787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.8392157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1024c85e0>} 2025-05-07T20:32:31.8393480Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.8394501Z context = 2025-05-07T20:32:31.8394783Z 2025-05-07T20:32:31.8394957Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.8395467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.8395933Z module_map=module_map) 2025-05-07T20:32:31.8396298Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.8396652Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.8396915Z E ^ 2025-05-07T20:32:31.8397384Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8397827Z 2025-05-07T20:32:31.8398245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.8398801Z 2025-05-07T20:32:31.8398907Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.8399439Z self=, 2025-05-07T20:32:31.8399840Z T=2048, 2025-05-07T20:32:31.8400027Z D=7168, 2025-05-07T20:32:31.8400227Z scale_ub=1200.0, 2025-05-07T20:32:31.8400457Z contiguous=False, 2025-05-07T20:32:31.8400688Z compiled=True, 2025-05-07T20:32:31.8400911Z ) 2025-05-07T20:32:31.8401236Z self = 2025-05-07T20:32:31.8401781Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:31.8402051Z 2025-05-07T20:32:31.8402138Z @given( 2025-05-07T20:32:31.8402366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.8402697Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.8402999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.8403330Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.8403661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.8403951Z ) 2025-05-07T20:32:31.8404303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.8404773Z def test_silu_mul_quant( 2025-05-07T20:32:31.8405035Z self, 2025-05-07T20:32:31.8405237Z T: int, 2025-05-07T20:32:31.8405444Z D: int, 2025-05-07T20:32:31.8405663Z scale_ub: Optional[float], 2025-05-07T20:32:31.8405938Z contiguous: bool, 2025-05-07T20:32:31.8406188Z compiled: bool, 2025-05-07T20:32:31.8406411Z ) -> None: 2025-05-07T20:32:31.8406632Z torch.manual_seed(2025) 2025-05-07T20:32:31.8406883Z 2025-05-07T20:32:31.8407156Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.8407501Z 2025-05-07T20:32:31.8407702Z x_sign = torch.sign(x) 2025-05-07T20:32:31.8407999Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.8408307Z x = x_sign * x_clamp 2025-05-07T20:32:31.8408561Z x0 = x[:, :D] 2025-05-07T20:32:31.8408789Z x1 = x[:, D:] 2025-05-07T20:32:31.8408999Z 2025-05-07T20:32:31.8409196Z if contiguous: 2025-05-07T20:32:31.8409434Z x0 = x0.contiguous() 2025-05-07T20:32:31.8409691Z x1 = x1.contiguous() 2025-05-07T20:32:31.8409939Z 2025-05-07T20:32:31.8410139Z if scale_ub is not None: 2025-05-07T20:32:31.8410416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.8410771Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.8411081Z ) 2025-05-07T20:32:31.8411274Z else: 2025-05-07T20:32:31.8411493Z scale_ub_tensor = None 2025-05-07T20:32:31.8411752Z 2025-05-07T20:32:31.8411982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.8412295Z op = silu_mul_quant 2025-05-07T20:32:31.8412554Z if compiled: 2025-05-07T20:32:31.8412804Z op = torch.compile(op) 2025-05-07T20:32:31.8413167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8413444Z 2025-05-07T20:32:31.8413642Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.8413806Z 2025-05-07T20:32:31.8413907Z moe/activation_test.py:117: 2025-05-07T20:32:31.8414241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8414606Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.8414884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.8415444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:31.8416008Z return fn(*args, **kwargs) 2025-05-07T20:32:31.8416665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.8417395Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.8418015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.8418692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.8419347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.8419881Z kernel = self.compile( 2025-05-07T20:32:31.8420424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.8421118Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.8421511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.8421743Z 2025-05-07T20:32:31.8421952Z self = 2025-05-07T20:32:31.8423026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.8424409Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10313a8e0>} 2025-05-07T20:32:31.8425756Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.8426775Z context = 2025-05-07T20:32:31.8427071Z 2025-05-07T20:32:31.8427239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.8427765Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.8428234Z module_map=module_map) 2025-05-07T20:32:31.8428606Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.8428976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.8429241Z E ^ 2025-05-07T20:32:31.8429705Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.8430159Z 2025-05-07T20:32:31.8430572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.8431079Z 2025-05-07T20:32:31.9594944Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.9596251Z self=, 2025-05-07T20:32:31.9597344Z T=1, 2025-05-07T20:32:31.9597863Z D=5120, 2025-05-07T20:32:31.9598260Z scale_ub=None, 2025-05-07T20:32:31.9598698Z contiguous=False, 2025-05-07T20:32:31.9599155Z compiled=False, 2025-05-07T20:32:31.9599600Z ) 2025-05-07T20:32:31.9600249Z self = 2025-05-07T20:32:31.9601227Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:31.9601745Z 2025-05-07T20:32:31.9601898Z @given( 2025-05-07T20:32:31.9602363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.9602986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.9603604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.9604251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.9604916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.9605283Z ) 2025-05-07T20:32:31.9605650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.9606094Z def test_silu_mul_quant( 2025-05-07T20:32:31.9606341Z self, 2025-05-07T20:32:31.9606540Z T: int, 2025-05-07T20:32:31.9607005Z D: int, 2025-05-07T20:32:31.9607238Z scale_ub: Optional[float], 2025-05-07T20:32:31.9607516Z contiguous: bool, 2025-05-07T20:32:31.9607913Z compiled: bool, 2025-05-07T20:32:31.9608156Z ) -> None: 2025-05-07T20:32:31.9608374Z torch.manual_seed(2025) 2025-05-07T20:32:31.9608622Z 2025-05-07T20:32:31.9608911Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.9609248Z 2025-05-07T20:32:31.9609459Z x_sign = torch.sign(x) 2025-05-07T20:32:31.9609757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.9610147Z x = x_sign * x_clamp 2025-05-07T20:32:31.9610393Z x0 = x[:, :D] 2025-05-07T20:32:31.9610615Z x1 = x[:, D:] 2025-05-07T20:32:31.9610828Z 2025-05-07T20:32:31.9611015Z if contiguous: 2025-05-07T20:32:31.9611257Z x0 = x0.contiguous() 2025-05-07T20:32:31.9611527Z x1 = x1.contiguous() 2025-05-07T20:32:31.9611774Z 2025-05-07T20:32:31.9611981Z if scale_ub is not None: 2025-05-07T20:32:31.9612263Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.9612610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.9612922Z ) 2025-05-07T20:32:31.9613224Z else: 2025-05-07T20:32:31.9613432Z scale_ub_tensor = None 2025-05-07T20:32:31.9613697Z 2025-05-07T20:32:31.9613934Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.9614244Z op = silu_mul_quant 2025-05-07T20:32:31.9614507Z if compiled: 2025-05-07T20:32:31.9614768Z op = torch.compile(op) 2025-05-07T20:32:31.9615067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.9615338Z 2025-05-07T20:32:31.9615537Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.9615702Z 2025-05-07T20:32:31.9615810Z moe/activation_test.py:117: 2025-05-07T20:32:31.9616104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.9616451Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.9616746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.9617433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.9618115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.9618650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.9619328Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.9619981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.9620509Z kernel = self.compile( 2025-05-07T20:32:31.9621047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.9621694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.9622093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.9622327Z 2025-05-07T20:32:31.9622534Z self = 2025-05-07T20:32:31.9623603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.9624966Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10333c7c0>} 2025-05-07T20:32:31.9626282Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.9627359Z context = 2025-05-07T20:32:31.9627656Z 2025-05-07T20:32:31.9627904Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.9628425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.9628884Z module_map=module_map) 2025-05-07T20:32:31.9629254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.9629612Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.9629866Z E ^ 2025-05-07T20:32:31.9630373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.9630826Z 2025-05-07T20:32:31.9631238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.9631743Z 2025-05-07T20:32:31.9631857Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.9632269Z self=, 2025-05-07T20:32:31.9632678Z T=4096, 2025-05-07T20:32:31.9632882Z D=7168, 2025-05-07T20:32:31.9633081Z scale_ub=1200.0, 2025-05-07T20:32:31.9633314Z contiguous=False, 2025-05-07T20:32:31.9633546Z compiled=False, 2025-05-07T20:32:31.9633747Z ) 2025-05-07T20:32:31.9634069Z self = 2025-05-07T20:32:31.9634561Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:31.9634837Z 2025-05-07T20:32:31.9634926Z @given( 2025-05-07T20:32:31.9635157Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.9635477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.9635790Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.9636116Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.9636449Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.9636746Z ) 2025-05-07T20:32:31.9637098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.9637547Z def test_silu_mul_quant( 2025-05-07T20:32:31.9637799Z self, 2025-05-07T20:32:31.9638003Z T: int, 2025-05-07T20:32:31.9638206Z D: int, 2025-05-07T20:32:31.9638434Z scale_ub: Optional[float], 2025-05-07T20:32:31.9638721Z contiguous: bool, 2025-05-07T20:32:31.9638968Z compiled: bool, 2025-05-07T20:32:31.9639199Z ) -> None: 2025-05-07T20:32:31.9639427Z torch.manual_seed(2025) 2025-05-07T20:32:31.9639674Z 2025-05-07T20:32:31.9639954Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.9640304Z 2025-05-07T20:32:31.9640500Z x_sign = torch.sign(x) 2025-05-07T20:32:31.9640799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.9641116Z x = x_sign * x_clamp 2025-05-07T20:32:31.9641365Z x0 = x[:, :D] 2025-05-07T20:32:31.9641587Z x1 = x[:, D:] 2025-05-07T20:32:31.9641802Z 2025-05-07T20:32:31.9641998Z if contiguous: 2025-05-07T20:32:31.9642250Z x0 = x0.contiguous() 2025-05-07T20:32:31.9642520Z x1 = x1.contiguous() 2025-05-07T20:32:31.9642767Z 2025-05-07T20:32:31.9642969Z if scale_ub is not None: 2025-05-07T20:32:31.9643262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.9643611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.9643920Z ) 2025-05-07T20:32:31.9644136Z else: 2025-05-07T20:32:31.9644368Z scale_ub_tensor = None 2025-05-07T20:32:31.9644622Z 2025-05-07T20:32:31.9644869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.9645187Z op = silu_mul_quant 2025-05-07T20:32:31.9645441Z if compiled: 2025-05-07T20:32:31.9645704Z op = torch.compile(op) 2025-05-07T20:32:31.9646061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.9646335Z 2025-05-07T20:32:31.9646543Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.9646820Z 2025-05-07T20:32:31.9646928Z moe/activation_test.py:117: 2025-05-07T20:32:31.9656984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.9657340Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.9657628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.9658325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.9659095Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.9659912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.9660580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.9661235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.9661766Z kernel = self.compile( 2025-05-07T20:32:31.9662311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.9662954Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.9663348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.9663586Z 2025-05-07T20:32:31.9663799Z self = 2025-05-07T20:32:31.9664869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.9666221Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1085b5620>} 2025-05-07T20:32:31.9667544Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.9668550Z context = 2025-05-07T20:32:31.9668832Z 2025-05-07T20:32:31.9669002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.9669510Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.9669977Z module_map=module_map) 2025-05-07T20:32:31.9670349Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.9670705Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.9670963Z E ^ 2025-05-07T20:32:31.9671423Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.9671869Z 2025-05-07T20:32:31.9672290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.9672794Z 2025-05-07T20:32:31.9672906Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.9673318Z self=, 2025-05-07T20:32:31.9673726Z T=16384, 2025-05-07T20:32:31.9673926Z D=7168, 2025-05-07T20:32:31.9674118Z scale_ub=None, 2025-05-07T20:32:31.9674343Z contiguous=True, 2025-05-07T20:32:31.9674610Z compiled=True, 2025-05-07T20:32:31.9674825Z ) 2025-05-07T20:32:32.1432023Z self = 2025-05-07T20:32:32.1432779Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.1433160Z 2025-05-07T20:32:32.1433271Z @given( 2025-05-07T20:32:32.1433907Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.1434330Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.1434934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.1435377Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.1435722Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.1436004Z ) 2025-05-07T20:32:32.1436359Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.1436805Z def test_silu_mul_quant( 2025-05-07T20:32:32.1437132Z self, 2025-05-07T20:32:32.1437335Z T: int, 2025-05-07T20:32:32.1437543Z D: int, 2025-05-07T20:32:32.1437761Z scale_ub: Optional[float], 2025-05-07T20:32:32.1438042Z contiguous: bool, 2025-05-07T20:32:32.1438291Z compiled: bool, 2025-05-07T20:32:32.1438530Z ) -> None: 2025-05-07T20:32:32.1438746Z torch.manual_seed(2025) 2025-05-07T20:32:32.1439002Z 2025-05-07T20:32:32.1439282Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.1439620Z 2025-05-07T20:32:32.1439828Z x_sign = torch.sign(x) 2025-05-07T20:32:32.1440127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.1440437Z x = x_sign * x_clamp 2025-05-07T20:32:32.1440688Z x0 = x[:, :D] 2025-05-07T20:32:32.1440912Z x1 = x[:, D:] 2025-05-07T20:32:32.1441116Z 2025-05-07T20:32:32.1441311Z if contiguous: 2025-05-07T20:32:32.1441558Z x0 = x0.contiguous() 2025-05-07T20:32:32.1441820Z x1 = x1.contiguous() 2025-05-07T20:32:32.1442070Z 2025-05-07T20:32:32.1442273Z if scale_ub is not None: 2025-05-07T20:32:32.1442554Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.1442903Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.1443235Z ) 2025-05-07T20:32:32.1443450Z else: 2025-05-07T20:32:32.1443672Z scale_ub_tensor = None 2025-05-07T20:32:32.1443925Z 2025-05-07T20:32:32.1444160Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.1444477Z op = silu_mul_quant 2025-05-07T20:32:32.1444731Z if compiled: 2025-05-07T20:32:32.1444986Z op = torch.compile(op) 2025-05-07T20:32:32.1445277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1445561Z 2025-05-07T20:32:32.1445763Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.1445929Z 2025-05-07T20:32:32.1446030Z moe/activation_test.py:117: 2025-05-07T20:32:32.1446338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1446669Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.1446953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1447511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.1448073Z return fn(*args, **kwargs) 2025-05-07T20:32:32.1448739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.1449420Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.1449957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.1450637Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.1451294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.1451824Z kernel = self.compile( 2025-05-07T20:32:32.1452372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.1453169Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.1453573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1453864Z 2025-05-07T20:32:32.1454075Z self = 2025-05-07T20:32:32.1455279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.1456654Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10852c900>} 2025-05-07T20:32:32.1458030Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.1459047Z context = 2025-05-07T20:32:32.1459701Z 2025-05-07T20:32:32.1459876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.1460401Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.1460870Z module_map=module_map) 2025-05-07T20:32:32.1461229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.1461588Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.1461854Z E ^ 2025-05-07T20:32:32.1462316Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.1462768Z 2025-05-07T20:32:32.1463184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.1463699Z 2025-05-07T20:32:32.1463807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.1464230Z self=, 2025-05-07T20:32:32.1464635Z T=4096, 2025-05-07T20:32:32.1464841Z D=5120, 2025-05-07T20:32:32.1465065Z scale_ub=None, 2025-05-07T20:32:32.1465337Z contiguous=False, 2025-05-07T20:32:32.1465572Z compiled=True, 2025-05-07T20:32:32.1465790Z ) 2025-05-07T20:32:32.1466116Z self = 2025-05-07T20:32:32.1466605Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.1466877Z 2025-05-07T20:32:32.1466956Z @given( 2025-05-07T20:32:32.1467190Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.1467513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.1467819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.1468149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.1468487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.1468773Z ) 2025-05-07T20:32:32.1469133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.1469579Z def test_silu_mul_quant( 2025-05-07T20:32:32.1469823Z self, 2025-05-07T20:32:32.1470030Z T: int, 2025-05-07T20:32:32.1470237Z D: int, 2025-05-07T20:32:32.1470458Z scale_ub: Optional[float], 2025-05-07T20:32:32.1470738Z contiguous: bool, 2025-05-07T20:32:32.1470989Z compiled: bool, 2025-05-07T20:32:32.1471209Z ) -> None: 2025-05-07T20:32:32.1471427Z torch.manual_seed(2025) 2025-05-07T20:32:32.1471679Z 2025-05-07T20:32:32.1471954Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.1472304Z 2025-05-07T20:32:32.1472502Z x_sign = torch.sign(x) 2025-05-07T20:32:32.1472801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.1473109Z x = x_sign * x_clamp 2025-05-07T20:32:32.1473360Z x0 = x[:, :D] 2025-05-07T20:32:32.1473587Z x1 = x[:, D:] 2025-05-07T20:32:32.1473898Z 2025-05-07T20:32:32.1474093Z if contiguous: 2025-05-07T20:32:32.1474345Z x0 = x0.contiguous() 2025-05-07T20:32:32.1474766Z x1 = x1.contiguous() 2025-05-07T20:32:32.1475022Z 2025-05-07T20:32:32.1475223Z if scale_ub is not None: 2025-05-07T20:32:32.1475499Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.1475836Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.1476152Z ) 2025-05-07T20:32:32.1476351Z else: 2025-05-07T20:32:32.1476568Z scale_ub_tensor = None 2025-05-07T20:32:32.1476891Z 2025-05-07T20:32:32.1477119Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.1477454Z op = silu_mul_quant 2025-05-07T20:32:32.1477703Z if compiled: 2025-05-07T20:32:32.1477959Z op = torch.compile(op) 2025-05-07T20:32:32.1478262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1478545Z 2025-05-07T20:32:32.1478743Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.1478915Z 2025-05-07T20:32:32.1479018Z moe/activation_test.py:117: 2025-05-07T20:32:32.1479329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1479655Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.1479940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.1480496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.1481045Z return fn(*args, **kwargs) 2025-05-07T20:32:32.1481702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.1482386Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.1482925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.1483595Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.1484257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.1484797Z kernel = self.compile( 2025-05-07T20:32:32.1485391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.1486032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.1486429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.1486653Z 2025-05-07T20:32:32.1486869Z self = 2025-05-07T20:32:32.1487925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.1489281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017a034c0>} 2025-05-07T20:32:32.1490608Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.1491616Z context = 2025-05-07T20:32:32.1491899Z 2025-05-07T20:32:32.1492072Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.1492584Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.1493115Z module_map=module_map) 2025-05-07T20:32:32.1493484Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.1493832Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.1494097Z E ^ 2025-05-07T20:32:32.1494612Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.1495186Z 2025-05-07T20:32:32.1495634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.1496137Z 2025-05-07T20:32:32.2971408Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.2972038Z self=, 2025-05-07T20:32:32.2972628Z T=4096, 2025-05-07T20:32:32.2973241Z D=5120, 2025-05-07T20:32:32.2973443Z scale_ub=1200.0, 2025-05-07T20:32:32.2973663Z contiguous=False, 2025-05-07T20:32:32.2973893Z compiled=False, 2025-05-07T20:32:32.2974110Z ) 2025-05-07T20:32:32.2974439Z self = 2025-05-07T20:32:32.2974931Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.2975216Z 2025-05-07T20:32:32.2975295Z @given( 2025-05-07T20:32:32.2975533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.2975855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.2976170Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.2976505Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.2976832Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.2977125Z ) 2025-05-07T20:32:32.2977480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.2977931Z def test_silu_mul_quant( 2025-05-07T20:32:32.2978168Z self, 2025-05-07T20:32:32.2978369Z T: int, 2025-05-07T20:32:32.2978571Z D: int, 2025-05-07T20:32:32.2978794Z scale_ub: Optional[float], 2025-05-07T20:32:32.2979071Z contiguous: bool, 2025-05-07T20:32:32.2979321Z compiled: bool, 2025-05-07T20:32:32.2979547Z ) -> None: 2025-05-07T20:32:32.2979773Z torch.manual_seed(2025) 2025-05-07T20:32:32.2980020Z 2025-05-07T20:32:32.2980295Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.2980640Z 2025-05-07T20:32:32.2980844Z x_sign = torch.sign(x) 2025-05-07T20:32:32.2981131Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.2981439Z x = x_sign * x_clamp 2025-05-07T20:32:32.2981686Z x0 = x[:, :D] 2025-05-07T20:32:32.2981904Z x1 = x[:, D:] 2025-05-07T20:32:32.2982131Z 2025-05-07T20:32:32.2982337Z if contiguous: 2025-05-07T20:32:32.2982571Z x0 = x0.contiguous() 2025-05-07T20:32:32.2982834Z x1 = x1.contiguous() 2025-05-07T20:32:32.2983075Z 2025-05-07T20:32:32.2983270Z if scale_ub is not None: 2025-05-07T20:32:32.2983542Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.2983881Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.2984193Z ) 2025-05-07T20:32:32.2984386Z else: 2025-05-07T20:32:32.2984600Z scale_ub_tensor = None 2025-05-07T20:32:32.2984859Z 2025-05-07T20:32:32.2985093Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.2985412Z op = silu_mul_quant 2025-05-07T20:32:32.2985666Z if compiled: 2025-05-07T20:32:32.2985911Z op = torch.compile(op) 2025-05-07T20:32:32.2986209Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.2986487Z 2025-05-07T20:32:32.2986685Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.2986856Z 2025-05-07T20:32:32.2986960Z moe/activation_test.py:117: 2025-05-07T20:32:32.2987264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.2987599Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.2987877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.2988561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.2989343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.2990005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.2990693Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.2991359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.2991888Z kernel = self.compile( 2025-05-07T20:32:32.2992427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.2993130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.2993528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.2993754Z 2025-05-07T20:32:32.2993968Z self = 2025-05-07T20:32:32.2995088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.2996455Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017a02840>} 2025-05-07T20:32:32.2997778Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.2998793Z context = 2025-05-07T20:32:32.2999078Z 2025-05-07T20:32:32.2999247Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.2999773Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.3000242Z module_map=module_map) 2025-05-07T20:32:32.3000617Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.3000972Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.3001240Z E ^ 2025-05-07T20:32:32.3001710Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.3002153Z 2025-05-07T20:32:32.3002566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.3003081Z 2025-05-07T20:32:32.3003187Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.3003605Z self=, 2025-05-07T20:32:32.3004008Z T=4096, 2025-05-07T20:32:32.3004198Z D=5120, 2025-05-07T20:32:32.3004395Z scale_ub=1200.0, 2025-05-07T20:32:32.3004646Z contiguous=False, 2025-05-07T20:32:32.3004900Z compiled=True, 2025-05-07T20:32:32.3005111Z ) 2025-05-07T20:32:32.3005440Z self = 2025-05-07T20:32:32.3005929Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.3006204Z 2025-05-07T20:32:32.3006284Z @given( 2025-05-07T20:32:32.3006524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.3006843Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.3007146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.3007482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.3007811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.3008093Z ) 2025-05-07T20:32:32.3008440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.3008882Z def test_silu_mul_quant( 2025-05-07T20:32:32.3009176Z self, 2025-05-07T20:32:32.3009377Z T: int, 2025-05-07T20:32:32.3009580Z D: int, 2025-05-07T20:32:32.3009878Z scale_ub: Optional[float], 2025-05-07T20:32:32.3010159Z contiguous: bool, 2025-05-07T20:32:32.3010406Z compiled: bool, 2025-05-07T20:32:32.3010630Z ) -> None: 2025-05-07T20:32:32.3010856Z torch.manual_seed(2025) 2025-05-07T20:32:32.3011105Z 2025-05-07T20:32:32.3011378Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.3011732Z 2025-05-07T20:32:32.3011936Z x_sign = torch.sign(x) 2025-05-07T20:32:32.3012289Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.3012603Z x = x_sign * x_clamp 2025-05-07T20:32:32.3012859Z x0 = x[:, :D] 2025-05-07T20:32:32.3013159Z x1 = x[:, D:] 2025-05-07T20:32:32.3013373Z 2025-05-07T20:32:32.3013572Z if contiguous: 2025-05-07T20:32:32.3013813Z x0 = x0.contiguous() 2025-05-07T20:32:32.3014074Z x1 = x1.contiguous() 2025-05-07T20:32:32.3014324Z 2025-05-07T20:32:32.3014524Z if scale_ub is not None: 2025-05-07T20:32:32.3014804Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.3015145Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.3015457Z ) 2025-05-07T20:32:32.3015646Z else: 2025-05-07T20:32:32.3015863Z scale_ub_tensor = None 2025-05-07T20:32:32.3016118Z 2025-05-07T20:32:32.3016346Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.3016661Z op = silu_mul_quant 2025-05-07T20:32:32.3016916Z if compiled: 2025-05-07T20:32:32.3017168Z op = torch.compile(op) 2025-05-07T20:32:32.3017464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.3017748Z 2025-05-07T20:32:32.3017947Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.3018111Z 2025-05-07T20:32:32.3018211Z moe/activation_test.py:117: 2025-05-07T20:32:32.3018522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.3018855Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.3019139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.3019697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.3020258Z return fn(*args, **kwargs) 2025-05-07T20:32:32.3020918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.3021603Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.3022140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.3022817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.3023472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.3024006Z kernel = self.compile( 2025-05-07T20:32:32.3024553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.3025206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.3025600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.3025833Z 2025-05-07T20:32:32.3026038Z self = 2025-05-07T20:32:32.3027105Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.3028459Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017a016c0>} 2025-05-07T20:32:32.3029906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.3030938Z context = 2025-05-07T20:32:32.3031232Z 2025-05-07T20:32:32.3031403Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.3031926Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.3032432Z module_map=module_map) 2025-05-07T20:32:32.3032806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.3033168Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.3033439Z E ^ 2025-05-07T20:32:32.3033911Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.3034371Z 2025-05-07T20:32:32.3034791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.3035292Z 2025-05-07T20:32:32.4186856Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4187498Z self=, 2025-05-07T20:32:32.4187926Z T=2048, 2025-05-07T20:32:32.4188136Z D=7168, 2025-05-07T20:32:32.4188343Z scale_ub=1200.0, 2025-05-07T20:32:32.4188593Z contiguous=False, 2025-05-07T20:32:32.4188874Z compiled=False, 2025-05-07T20:32:32.4189100Z ) 2025-05-07T20:32:32.4189605Z self = 2025-05-07T20:32:32.4200611Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.4200940Z 2025-05-07T20:32:32.4201024Z @given( 2025-05-07T20:32:32.4201269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4201595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4201901Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4202244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4202574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4202867Z ) 2025-05-07T20:32:32.4203214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4203668Z def test_silu_mul_quant( 2025-05-07T20:32:32.4203919Z self, 2025-05-07T20:32:32.4204112Z T: int, 2025-05-07T20:32:32.4204321Z D: int, 2025-05-07T20:32:32.4204547Z scale_ub: Optional[float], 2025-05-07T20:32:32.4204816Z contiguous: bool, 2025-05-07T20:32:32.4205066Z compiled: bool, 2025-05-07T20:32:32.4205296Z ) -> None: 2025-05-07T20:32:32.4205506Z torch.manual_seed(2025) 2025-05-07T20:32:32.4205757Z 2025-05-07T20:32:32.4206035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4206374Z 2025-05-07T20:32:32.4206575Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4206874Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4207181Z x = x_sign * x_clamp 2025-05-07T20:32:32.4207424Z x0 = x[:, :D] 2025-05-07T20:32:32.4207652Z x1 = x[:, D:] 2025-05-07T20:32:32.4207857Z 2025-05-07T20:32:32.4208050Z if contiguous: 2025-05-07T20:32:32.4208291Z x0 = x0.contiguous() 2025-05-07T20:32:32.4208554Z x1 = x1.contiguous() 2025-05-07T20:32:32.4208793Z 2025-05-07T20:32:32.4208990Z if scale_ub is not None: 2025-05-07T20:32:32.4209262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4209588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4209900Z ) 2025-05-07T20:32:32.4210097Z else: 2025-05-07T20:32:32.4210307Z scale_ub_tensor = None 2025-05-07T20:32:32.4210820Z 2025-05-07T20:32:32.4211058Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4211367Z op = silu_mul_quant 2025-05-07T20:32:32.4211771Z if compiled: 2025-05-07T20:32:32.4212029Z op = torch.compile(op) 2025-05-07T20:32:32.4212321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4212601Z 2025-05-07T20:32:32.4212800Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.4212965Z 2025-05-07T20:32:32.4213183Z moe/activation_test.py:117: 2025-05-07T20:32:32.4213480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4213896Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.4214178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4214859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.4215544Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.4216078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4216759Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4217419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4217943Z kernel = self.compile( 2025-05-07T20:32:32.4218482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4219126Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4219526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4219763Z 2025-05-07T20:32:32.4219973Z self = 2025-05-07T20:32:32.4221047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4222407Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017e471a0>} 2025-05-07T20:32:32.4223735Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4224749Z context = 2025-05-07T20:32:32.4225033Z 2025-05-07T20:32:32.4225205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4225721Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4226181Z module_map=module_map) 2025-05-07T20:32:32.4226551Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4226910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4227164Z E ^ 2025-05-07T20:32:32.4227625Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4228066Z 2025-05-07T20:32:32.4228482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.4228985Z 2025-05-07T20:32:32.4229099Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4229513Z self=, 2025-05-07T20:32:32.4229914Z T=1, 2025-05-07T20:32:32.4230103Z D=7168, 2025-05-07T20:32:32.4230295Z scale_ub=None, 2025-05-07T20:32:32.4230510Z contiguous=True, 2025-05-07T20:32:32.4230735Z compiled=False, 2025-05-07T20:32:32.4230942Z ) 2025-05-07T20:32:32.4231313Z self = 2025-05-07T20:32:32.4231876Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:32.4232136Z 2025-05-07T20:32:32.4232223Z @given( 2025-05-07T20:32:32.4232454Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.4232776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.4233089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.4233414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.4233789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.4234076Z ) 2025-05-07T20:32:32.4234417Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.4234882Z def test_silu_mul_quant( 2025-05-07T20:32:32.4235159Z self, 2025-05-07T20:32:32.4235351Z T: int, 2025-05-07T20:32:32.4235554Z D: int, 2025-05-07T20:32:32.4235783Z scale_ub: Optional[float], 2025-05-07T20:32:32.4236056Z contiguous: bool, 2025-05-07T20:32:32.4236304Z compiled: bool, 2025-05-07T20:32:32.4236540Z ) -> None: 2025-05-07T20:32:32.4236752Z torch.manual_seed(2025) 2025-05-07T20:32:32.4236998Z 2025-05-07T20:32:32.4237274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.4237622Z 2025-05-07T20:32:32.4237820Z x_sign = torch.sign(x) 2025-05-07T20:32:32.4238120Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.4238443Z x = x_sign * x_clamp 2025-05-07T20:32:32.4238692Z x0 = x[:, :D] 2025-05-07T20:32:32.4238916Z x1 = x[:, D:] 2025-05-07T20:32:32.4239129Z 2025-05-07T20:32:32.4239319Z if contiguous: 2025-05-07T20:32:32.4239560Z x0 = x0.contiguous() 2025-05-07T20:32:32.4239824Z x1 = x1.contiguous() 2025-05-07T20:32:32.4240068Z 2025-05-07T20:32:32.4240270Z if scale_ub is not None: 2025-05-07T20:32:32.4240552Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.4240887Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.4241202Z ) 2025-05-07T20:32:32.4241407Z else: 2025-05-07T20:32:32.4241613Z scale_ub_tensor = None 2025-05-07T20:32:32.4241873Z 2025-05-07T20:32:32.4242111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.4242428Z op = silu_mul_quant 2025-05-07T20:32:32.4242677Z if compiled: 2025-05-07T20:32:32.4242928Z op = torch.compile(op) 2025-05-07T20:32:32.4243228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4243503Z 2025-05-07T20:32:32.4243701Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.4243867Z 2025-05-07T20:32:32.4243976Z moe/activation_test.py:117: 2025-05-07T20:32:32.4244269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4244610Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.4244900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.4245632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.4246320Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.4246857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.4247534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.4248189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.4248727Z kernel = self.compile( 2025-05-07T20:32:32.4249269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.4249917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.4250358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.4250592Z 2025-05-07T20:32:32.4250874Z self = 2025-05-07T20:32:32.4251935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.4253339Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017e45bc0>} 2025-05-07T20:32:32.4254780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.4255784Z context = 2025-05-07T20:32:32.4256076Z 2025-05-07T20:32:32.4256242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.4256771Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.4257240Z module_map=module_map) 2025-05-07T20:32:32.4257603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.4257957Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.4258219Z E ^ 2025-05-07T20:32:32.4258681Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.4259133Z 2025-05-07T20:32:32.4259838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.4260348Z 2025-05-07T20:32:32.4260462Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.4260884Z self=, 2025-05-07T20:32:32.4261278Z T=16384, 2025-05-07T20:32:32.4261473Z D=7168, 2025-05-07T20:32:32.4261688Z scale_ub=1200.0, 2025-05-07T20:32:32.4261915Z contiguous=False, 2025-05-07T20:32:32.4262151Z compiled=True, 2025-05-07T20:32:32.6665923Z ) 2025-05-07T20:32:32.6666340Z self = 2025-05-07T20:32:32.6666983Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:32.6667279Z 2025-05-07T20:32:32.6667391Z @given( 2025-05-07T20:32:32.6667621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.6667947Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.6668260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.6668590Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.6668918Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.6669210Z ) 2025-05-07T20:32:32.6669570Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.6670036Z def test_silu_mul_quant( 2025-05-07T20:32:32.6670289Z self, 2025-05-07T20:32:32.6670490Z T: int, 2025-05-07T20:32:32.6670692Z D: int, 2025-05-07T20:32:32.6670920Z scale_ub: Optional[float], 2025-05-07T20:32:32.6671199Z contiguous: bool, 2025-05-07T20:32:32.6671435Z compiled: bool, 2025-05-07T20:32:32.6671673Z ) -> None: 2025-05-07T20:32:32.6671894Z torch.manual_seed(2025) 2025-05-07T20:32:32.6672139Z 2025-05-07T20:32:32.6672417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.6672760Z 2025-05-07T20:32:32.6672954Z x_sign = torch.sign(x) 2025-05-07T20:32:32.6673249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.6673561Z x = x_sign * x_clamp 2025-05-07T20:32:32.6673801Z x0 = x[:, :D] 2025-05-07T20:32:32.6674301Z x1 = x[:, D:] 2025-05-07T20:32:32.6674509Z 2025-05-07T20:32:32.6674695Z if contiguous: 2025-05-07T20:32:32.6675126Z x0 = x0.contiguous() 2025-05-07T20:32:32.6675399Z x1 = x1.contiguous() 2025-05-07T20:32:32.6675653Z 2025-05-07T20:32:32.6675846Z if scale_ub is not None: 2025-05-07T20:32:32.6676118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.6676453Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.6676756Z ) 2025-05-07T20:32:32.6676953Z else: 2025-05-07T20:32:32.6677250Z scale_ub_tensor = None 2025-05-07T20:32:32.6677500Z 2025-05-07T20:32:32.6677736Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.6678049Z op = silu_mul_quant 2025-05-07T20:32:32.6678294Z if compiled: 2025-05-07T20:32:32.6678552Z op = torch.compile(op) 2025-05-07T20:32:32.6678851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.6679152Z 2025-05-07T20:32:32.6679350Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.6679513Z 2025-05-07T20:32:32.6679620Z moe/activation_test.py:117: 2025-05-07T20:32:32.6679919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.6680252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.6680531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.6681090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.6681649Z return fn(*args, **kwargs) 2025-05-07T20:32:32.6682301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.6682972Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.6683505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.6684180Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.6684845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.6685416Z kernel = self.compile( 2025-05-07T20:32:32.6685953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.6686600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.6686991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.6687231Z 2025-05-07T20:32:32.6687437Z self = 2025-05-07T20:32:32.6688503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.6689874Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017e44b80>} 2025-05-07T20:32:32.6691196Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.6692205Z context = 2025-05-07T20:32:32.6692507Z 2025-05-07T20:32:32.6692672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.6693306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.6693773Z module_map=module_map) 2025-05-07T20:32:32.6694136Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.6694541Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.6694798Z E ^ 2025-05-07T20:32:32.6695329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.6695782Z 2025-05-07T20:32:32.6696196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.6696703Z 2025-05-07T20:32:32.6696808Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.6697219Z self=, 2025-05-07T20:32:32.6697653Z T=1, 2025-05-07T20:32:32.6697840Z D=7168, 2025-05-07T20:32:32.6698041Z scale_ub=None, 2025-05-07T20:32:32.6698256Z contiguous=False, 2025-05-07T20:32:32.6698487Z compiled=False, 2025-05-07T20:32:32.6698697Z ) 2025-05-07T20:32:32.6699010Z self = 2025-05-07T20:32:32.6699498Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:32.6699754Z 2025-05-07T20:32:32.6699835Z @given( 2025-05-07T20:32:32.6700069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.6700378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.6700683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.6701014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.6701337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.6701627Z ) 2025-05-07T20:32:32.6701976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.6702414Z def test_silu_mul_quant( 2025-05-07T20:32:32.6702661Z self, 2025-05-07T20:32:32.6702860Z T: int, 2025-05-07T20:32:32.6703060Z D: int, 2025-05-07T20:32:32.6703293Z scale_ub: Optional[float], 2025-05-07T20:32:32.6703574Z contiguous: bool, 2025-05-07T20:32:32.6703811Z compiled: bool, 2025-05-07T20:32:32.6704044Z ) -> None: 2025-05-07T20:32:32.6704263Z torch.manual_seed(2025) 2025-05-07T20:32:32.6704502Z 2025-05-07T20:32:32.6704788Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.6705131Z 2025-05-07T20:32:32.6705336Z x_sign = torch.sign(x) 2025-05-07T20:32:32.6705624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.6705938Z x = x_sign * x_clamp 2025-05-07T20:32:32.6706195Z x0 = x[:, :D] 2025-05-07T20:32:32.6706412Z x1 = x[:, D:] 2025-05-07T20:32:32.6706634Z 2025-05-07T20:32:32.6706829Z if contiguous: 2025-05-07T20:32:32.6707061Z x0 = x0.contiguous() 2025-05-07T20:32:32.6707324Z x1 = x1.contiguous() 2025-05-07T20:32:32.6707571Z 2025-05-07T20:32:32.6707762Z if scale_ub is not None: 2025-05-07T20:32:32.6708039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.6708379Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.6708693Z ) 2025-05-07T20:32:32.6708891Z else: 2025-05-07T20:32:32.6709118Z scale_ub_tensor = None 2025-05-07T20:32:32.6709370Z 2025-05-07T20:32:32.6709611Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.6709929Z op = silu_mul_quant 2025-05-07T20:32:32.6710187Z if compiled: 2025-05-07T20:32:32.6710437Z op = torch.compile(op) 2025-05-07T20:32:32.6710743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.6711027Z 2025-05-07T20:32:32.6711223Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.6711394Z 2025-05-07T20:32:32.6711496Z moe/activation_test.py:117: 2025-05-07T20:32:32.6711801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.6712136Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.6712424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.6713166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.6713928Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.6714460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.6715147Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.6715812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.6716376Z kernel = self.compile( 2025-05-07T20:32:32.6716915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.6717559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.6717956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.6718181Z 2025-05-07T20:32:32.6718387Z self = 2025-05-07T20:32:32.6719456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.6720803Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10243b9c0>} 2025-05-07T20:32:32.6722124Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.6723136Z context = 2025-05-07T20:32:32.6723419Z 2025-05-07T20:32:32.6723585Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.6724106Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.6724576Z module_map=module_map) 2025-05-07T20:32:32.6724942Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.6725346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.6725612Z E ^ 2025-05-07T20:32:32.6726079Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.6726526Z 2025-05-07T20:32:32.6726934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.6727441Z 2025-05-07T20:32:32.6727546Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.6727958Z self=, 2025-05-07T20:32:32.6728355Z T=2048, 2025-05-07T20:32:32.6728554Z D=7168, 2025-05-07T20:32:32.6728751Z scale_ub=None, 2025-05-07T20:32:32.6728963Z contiguous=False, 2025-05-07T20:32:32.6729190Z compiled=True, 2025-05-07T20:32:32.6729398Z ) 2025-05-07T20:32:32.7603187Z self = 2025-05-07T20:32:32.7604577Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.7605094Z 2025-05-07T20:32:32.7605194Z @given( 2025-05-07T20:32:32.7605449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7605766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7606094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7606414Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7606745Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7607034Z ) 2025-05-07T20:32:32.7607381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7608096Z def test_silu_mul_quant( 2025-05-07T20:32:32.7608339Z self, 2025-05-07T20:32:32.7608529Z T: int, 2025-05-07T20:32:32.7608891Z D: int, 2025-05-07T20:32:32.7609116Z scale_ub: Optional[float], 2025-05-07T20:32:32.7609394Z contiguous: bool, 2025-05-07T20:32:32.7609634Z compiled: bool, 2025-05-07T20:32:32.7609862Z ) -> None: 2025-05-07T20:32:32.7610085Z torch.manual_seed(2025) 2025-05-07T20:32:32.7610325Z 2025-05-07T20:32:32.7610613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7611035Z 2025-05-07T20:32:32.7611238Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7611528Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7611842Z x = x_sign * x_clamp 2025-05-07T20:32:32.7612090Z x0 = x[:, :D] 2025-05-07T20:32:32.7612305Z x1 = x[:, D:] 2025-05-07T20:32:32.7612525Z 2025-05-07T20:32:32.7612717Z if contiguous: 2025-05-07T20:32:32.7612962Z x0 = x0.contiguous() 2025-05-07T20:32:32.7613325Z x1 = x1.contiguous() 2025-05-07T20:32:32.7613581Z 2025-05-07T20:32:32.7613795Z if scale_ub is not None: 2025-05-07T20:32:32.7614068Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7614406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7614727Z ) 2025-05-07T20:32:32.7614923Z else: 2025-05-07T20:32:32.7615149Z scale_ub_tensor = None 2025-05-07T20:32:32.7615404Z 2025-05-07T20:32:32.7615636Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7615956Z op = silu_mul_quant 2025-05-07T20:32:32.7616508Z if compiled: 2025-05-07T20:32:32.7616873Z op = torch.compile(op) 2025-05-07T20:32:32.7626473Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7626764Z 2025-05-07T20:32:32.7626955Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7627134Z 2025-05-07T20:32:32.7627235Z moe/activation_test.py:117: 2025-05-07T20:32:32.7627542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7627869Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7628148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7628708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7629256Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7629905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7630583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7631115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7631776Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7632434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7632963Z kernel = self.compile( 2025-05-07T20:32:32.7633498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7634141Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7634540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7634804Z 2025-05-07T20:32:32.7635019Z self = 2025-05-07T20:32:32.7636080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7637439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102439b20>} 2025-05-07T20:32:32.7638960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7639972Z context = 2025-05-07T20:32:32.7640254Z 2025-05-07T20:32:32.7640424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7640976Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7641445Z module_map=module_map) 2025-05-07T20:32:32.7641812Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7642156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7642422Z E ^ 2025-05-07T20:32:32.7642886Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7643331Z 2025-05-07T20:32:32.7643750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7644249Z 2025-05-07T20:32:32.7644353Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7644766Z self=, 2025-05-07T20:32:32.7645166Z T=4096, 2025-05-07T20:32:32.7645355Z D=7168, 2025-05-07T20:32:32.7645553Z scale_ub=None, 2025-05-07T20:32:32.7645776Z contiguous=False, 2025-05-07T20:32:32.7646001Z compiled=True, 2025-05-07T20:32:32.7646207Z ) 2025-05-07T20:32:32.7646527Z self = 2025-05-07T20:32:32.7647022Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:32.7647287Z 2025-05-07T20:32:32.7647363Z @given( 2025-05-07T20:32:32.7647603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7647921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7648223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7648554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7648882Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7649159Z ) 2025-05-07T20:32:32.7649512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7649947Z def test_silu_mul_quant( 2025-05-07T20:32:32.7650197Z self, 2025-05-07T20:32:32.7650389Z T: int, 2025-05-07T20:32:32.7650587Z D: int, 2025-05-07T20:32:32.7650809Z scale_ub: Optional[float], 2025-05-07T20:32:32.7651070Z contiguous: bool, 2025-05-07T20:32:32.7651313Z compiled: bool, 2025-05-07T20:32:32.7651539Z ) -> None: 2025-05-07T20:32:32.7651751Z torch.manual_seed(2025) 2025-05-07T20:32:32.7652004Z 2025-05-07T20:32:32.7652274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7652608Z 2025-05-07T20:32:32.7652805Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7653165Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7653466Z x = x_sign * x_clamp 2025-05-07T20:32:32.7653704Z x0 = x[:, :D] 2025-05-07T20:32:32.7653918Z x1 = x[:, D:] 2025-05-07T20:32:32.7654121Z 2025-05-07T20:32:32.7654315Z if contiguous: 2025-05-07T20:32:32.7654546Z x0 = x0.contiguous() 2025-05-07T20:32:32.7654826Z x1 = x1.contiguous() 2025-05-07T20:32:32.7655095Z 2025-05-07T20:32:32.7655292Z if scale_ub is not None: 2025-05-07T20:32:32.7655561Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7655896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7656204Z ) 2025-05-07T20:32:32.7656400Z else: 2025-05-07T20:32:32.7656670Z scale_ub_tensor = None 2025-05-07T20:32:32.7656929Z 2025-05-07T20:32:32.7657241Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7657555Z op = silu_mul_quant 2025-05-07T20:32:32.7657810Z if compiled: 2025-05-07T20:32:32.7658064Z op = torch.compile(op) 2025-05-07T20:32:32.7658352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7658631Z 2025-05-07T20:32:32.7658826Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.7658989Z 2025-05-07T20:32:32.7659097Z moe/activation_test.py:117: 2025-05-07T20:32:32.7659857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7660189Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.7660474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7661019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.7661583Z return fn(*args, **kwargs) 2025-05-07T20:32:32.7662247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.7662917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.7663448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7664115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7664776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7665339Z kernel = self.compile( 2025-05-07T20:32:32.7665872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7666519Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7666915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7667143Z 2025-05-07T20:32:32.7667351Z self = 2025-05-07T20:32:32.7668424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7669783Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10314f380>} 2025-05-07T20:32:32.7671107Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7672118Z context = 2025-05-07T20:32:32.7672401Z 2025-05-07T20:32:32.7672569Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7673087Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7673550Z module_map=module_map) 2025-05-07T20:32:32.7673906Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7674257Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.7674522Z E ^ 2025-05-07T20:32:32.7674987Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7675430Z 2025-05-07T20:32:32.7675839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7676350Z 2025-05-07T20:32:32.9255156Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.9255737Z self=, 2025-05-07T20:32:32.9256457Z T=16384, 2025-05-07T20:32:32.9256654Z D=5120, 2025-05-07T20:32:32.9256848Z scale_ub=1200.0, 2025-05-07T20:32:32.9257230Z contiguous=False, 2025-05-07T20:32:32.9257461Z compiled=False, 2025-05-07T20:32:32.9257671Z ) 2025-05-07T20:32:32.9257992Z self = 2025-05-07T20:32:32.9258494Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:32.9258771Z 2025-05-07T20:32:32.9258854Z @given( 2025-05-07T20:32:32.9259082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.9259769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.9260083Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.9260406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.9260734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.9261028Z ) 2025-05-07T20:32:32.9261371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.9261811Z def test_silu_mul_quant( 2025-05-07T20:32:32.9262068Z self, 2025-05-07T20:32:32.9262273Z T: int, 2025-05-07T20:32:32.9262475Z D: int, 2025-05-07T20:32:32.9262698Z scale_ub: Optional[float], 2025-05-07T20:32:32.9262969Z contiguous: bool, 2025-05-07T20:32:32.9263205Z compiled: bool, 2025-05-07T20:32:32.9263433Z ) -> None: 2025-05-07T20:32:32.9263652Z torch.manual_seed(2025) 2025-05-07T20:32:32.9263887Z 2025-05-07T20:32:32.9264159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.9264502Z 2025-05-07T20:32:32.9264693Z x_sign = torch.sign(x) 2025-05-07T20:32:32.9264986Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.9265297Z x = x_sign * x_clamp 2025-05-07T20:32:32.9265537Z x0 = x[:, :D] 2025-05-07T20:32:32.9265760Z x1 = x[:, D:] 2025-05-07T20:32:32.9265976Z 2025-05-07T20:32:32.9266160Z if contiguous: 2025-05-07T20:32:32.9266393Z x0 = x0.contiguous() 2025-05-07T20:32:32.9266664Z x1 = x1.contiguous() 2025-05-07T20:32:32.9266906Z 2025-05-07T20:32:32.9267103Z if scale_ub is not None: 2025-05-07T20:32:32.9267378Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.9267715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.9268014Z ) 2025-05-07T20:32:32.9268209Z else: 2025-05-07T20:32:32.9268429Z scale_ub_tensor = None 2025-05-07T20:32:32.9268684Z 2025-05-07T20:32:32.9268918Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.9269240Z op = silu_mul_quant 2025-05-07T20:32:32.9269484Z if compiled: 2025-05-07T20:32:32.9269737Z op = torch.compile(op) 2025-05-07T20:32:32.9270035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9270303Z 2025-05-07T20:32:32.9270517Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.9270682Z 2025-05-07T20:32:32.9270788Z moe/activation_test.py:117: 2025-05-07T20:32:32.9271102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9271438Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.9271718Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9272406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.9273084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.9273629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.9274298Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.9274958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.9275566Z kernel = self.compile( 2025-05-07T20:32:32.9276248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.9276903Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.9277301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9277526Z 2025-05-07T20:32:32.9277740Z self = 2025-05-07T20:32:32.9278797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.9280222Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10314df80>} 2025-05-07T20:32:32.9281553Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.9282566Z context = 2025-05-07T20:32:32.9282849Z 2025-05-07T20:32:32.9283017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.9283524Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.9283988Z module_map=module_map) 2025-05-07T20:32:32.9284351Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.9284696Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.9284956Z E ^ 2025-05-07T20:32:32.9285413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.9285858Z 2025-05-07T20:32:32.9286274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.9286780Z 2025-05-07T20:32:32.9286884Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.9287293Z self=, 2025-05-07T20:32:32.9287692Z T=16384, 2025-05-07T20:32:32.9287883Z D=5120, 2025-05-07T20:32:32.9288076Z scale_ub=1200.0, 2025-05-07T20:32:32.9288299Z contiguous=True, 2025-05-07T20:32:32.9288517Z compiled=True, 2025-05-07T20:32:32.9288727Z ) 2025-05-07T20:32:32.9289042Z self = 2025-05-07T20:32:32.9289534Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:32.9289806Z 2025-05-07T20:32:32.9289885Z @given( 2025-05-07T20:32:32.9290118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.9290431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.9290738Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.9291080Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.9291412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.9291695Z ) 2025-05-07T20:32:32.9292047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.9292492Z def test_silu_mul_quant( 2025-05-07T20:32:32.9292741Z self, 2025-05-07T20:32:32.9292940Z T: int, 2025-05-07T20:32:32.9293272Z D: int, 2025-05-07T20:32:32.9293508Z scale_ub: Optional[float], 2025-05-07T20:32:32.9293779Z contiguous: bool, 2025-05-07T20:32:32.9294022Z compiled: bool, 2025-05-07T20:32:32.9294248Z ) -> None: 2025-05-07T20:32:32.9294467Z torch.manual_seed(2025) 2025-05-07T20:32:32.9294715Z 2025-05-07T20:32:32.9294988Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.9295374Z 2025-05-07T20:32:32.9295582Z x_sign = torch.sign(x) 2025-05-07T20:32:32.9295954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.9296264Z x = x_sign * x_clamp 2025-05-07T20:32:32.9296513Z x0 = x[:, :D] 2025-05-07T20:32:32.9296740Z x1 = x[:, D:] 2025-05-07T20:32:32.9296953Z 2025-05-07T20:32:32.9297145Z if contiguous: 2025-05-07T20:32:32.9297389Z x0 = x0.contiguous() 2025-05-07T20:32:32.9297653Z x1 = x1.contiguous() 2025-05-07T20:32:32.9297905Z 2025-05-07T20:32:32.9298148Z if scale_ub is not None: 2025-05-07T20:32:32.9298425Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.9298757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.9299062Z ) 2025-05-07T20:32:32.9299261Z else: 2025-05-07T20:32:32.9299472Z scale_ub_tensor = None 2025-05-07T20:32:32.9299725Z 2025-05-07T20:32:32.9299959Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.9300268Z op = silu_mul_quant 2025-05-07T20:32:32.9300523Z if compiled: 2025-05-07T20:32:32.9300782Z op = torch.compile(op) 2025-05-07T20:32:32.9301072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9301364Z 2025-05-07T20:32:32.9301563Z > y_fp8, y_scale = fn() 2025-05-07T20:32:32.9301726Z 2025-05-07T20:32:32.9301833Z moe/activation_test.py:117: 2025-05-07T20:32:32.9302125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9302459Z moe/activation_test.py:115: in fn 2025-05-07T20:32:32.9302742Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.9303289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:32.9303840Z return fn(*args, **kwargs) 2025-05-07T20:32:32.9304492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:32.9305225Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:32.9305756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.9306430Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.9307080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.9307606Z kernel = self.compile( 2025-05-07T20:32:32.9308142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.9308798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.9309197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.9309420Z 2025-05-07T20:32:32.9309624Z self = 2025-05-07T20:32:32.9310707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.9312064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10314c180>} 2025-05-07T20:32:32.9313383Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.9314396Z context = 2025-05-07T20:32:32.9314684Z 2025-05-07T20:32:32.9314851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.9315416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.9315954Z module_map=module_map) 2025-05-07T20:32:32.9316321Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.9316679Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:32.9316939Z E ^ 2025-05-07T20:32:32.9317403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.9317843Z 2025-05-07T20:32:32.9318251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.9318802Z 2025-05-07T20:32:33.1025812Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.1026292Z self=, 2025-05-07T20:32:33.1026703Z T=16384, 2025-05-07T20:32:33.1026906Z D=5120, 2025-05-07T20:32:33.1027118Z scale_ub=None, 2025-05-07T20:32:33.1027341Z contiguous=False, 2025-05-07T20:32:33.1027575Z compiled=True, 2025-05-07T20:32:33.1027781Z ) 2025-05-07T20:32:33.1028122Z self = 2025-05-07T20:32:33.1028626Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.1028901Z 2025-05-07T20:32:33.1028989Z @given( 2025-05-07T20:32:33.1029215Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.1029542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.1029867Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.1030199Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.1030539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.1030835Z ) 2025-05-07T20:32:33.1031177Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.1031617Z def test_silu_mul_quant( 2025-05-07T20:32:33.1031872Z self, 2025-05-07T20:32:33.1032067Z T: int, 2025-05-07T20:32:33.1032273Z D: int, 2025-05-07T20:32:33.1032500Z scale_ub: Optional[float], 2025-05-07T20:32:33.1032774Z contiguous: bool, 2025-05-07T20:32:33.1033021Z compiled: bool, 2025-05-07T20:32:33.1033249Z ) -> None: 2025-05-07T20:32:33.1033467Z torch.manual_seed(2025) 2025-05-07T20:32:33.1033709Z 2025-05-07T20:32:33.1033989Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.1034330Z 2025-05-07T20:32:33.1034524Z x_sign = torch.sign(x) 2025-05-07T20:32:33.1034817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.1035131Z x = x_sign * x_clamp 2025-05-07T20:32:33.1035370Z x0 = x[:, :D] 2025-05-07T20:32:33.1035595Z x1 = x[:, D:] 2025-05-07T20:32:33.1035808Z 2025-05-07T20:32:33.1035995Z if contiguous: 2025-05-07T20:32:33.1036238Z x0 = x0.contiguous() 2025-05-07T20:32:33.1036500Z x1 = x1.contiguous() 2025-05-07T20:32:33.1036741Z 2025-05-07T20:32:33.1036943Z if scale_ub is not None: 2025-05-07T20:32:33.1037221Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.1037549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.1037863Z ) 2025-05-07T20:32:33.1038065Z else: 2025-05-07T20:32:33.1038281Z scale_ub_tensor = None 2025-05-07T20:32:33.1038539Z 2025-05-07T20:32:33.1038786Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1039101Z op = silu_mul_quant 2025-05-07T20:32:33.1039346Z if compiled: 2025-05-07T20:32:33.1039601Z op = torch.compile(op) 2025-05-07T20:32:33.1039901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1040171Z 2025-05-07T20:32:33.1040367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.1040531Z 2025-05-07T20:32:33.1040638Z moe/activation_test.py:117: 2025-05-07T20:32:33.1041182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1041662Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.1041949Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1042509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.1043059Z return fn(*args, **kwargs) 2025-05-07T20:32:33.1043714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.1044476Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.1045007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.1045732Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.1046388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.1046919Z kernel = self.compile( 2025-05-07T20:32:33.1047462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.1048130Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.1048523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1048756Z 2025-05-07T20:32:33.1048964Z self = 2025-05-07T20:32:33.1050038Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.1051415Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017c68b80>} 2025-05-07T20:32:33.1052746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.1053867Z context = 2025-05-07T20:32:33.1054158Z 2025-05-07T20:32:33.1054323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.1054840Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.1055311Z module_map=module_map) 2025-05-07T20:32:33.1055720Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.1056074Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.1056337Z E ^ 2025-05-07T20:32:33.1056801Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.1057251Z 2025-05-07T20:32:33.1057673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.1058180Z 2025-05-07T20:32:33.1058284Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.1058698Z self=, 2025-05-07T20:32:33.1059093Z T=2048, 2025-05-07T20:32:33.1059522Z D=5120, 2025-05-07T20:32:33.1059719Z scale_ub=None, 2025-05-07T20:32:33.1059929Z contiguous=False, 2025-05-07T20:32:33.1069098Z compiled=True, 2025-05-07T20:32:33.1069326Z ) 2025-05-07T20:32:33.1973665Z self = 2025-05-07T20:32:33.1974227Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.1974503Z 2025-05-07T20:32:33.1974583Z @given( 2025-05-07T20:32:33.1974821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.1975452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.1975946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.1976280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.1976599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.1976886Z ) 2025-05-07T20:32:33.1977235Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.1977674Z def test_silu_mul_quant( 2025-05-07T20:32:33.1977918Z self, 2025-05-07T20:32:33.1978201Z T: int, 2025-05-07T20:32:33.1978410Z D: int, 2025-05-07T20:32:33.1978632Z scale_ub: Optional[float], 2025-05-07T20:32:33.1978912Z contiguous: bool, 2025-05-07T20:32:33.1979155Z compiled: bool, 2025-05-07T20:32:33.1979381Z ) -> None: 2025-05-07T20:32:33.1979603Z torch.manual_seed(2025) 2025-05-07T20:32:33.1979855Z 2025-05-07T20:32:33.1980130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.1980486Z 2025-05-07T20:32:33.1980694Z x_sign = torch.sign(x) 2025-05-07T20:32:33.1980988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.1981299Z x = x_sign * x_clamp 2025-05-07T20:32:33.1981547Z x0 = x[:, :D] 2025-05-07T20:32:33.1981760Z x1 = x[:, D:] 2025-05-07T20:32:33.1981976Z 2025-05-07T20:32:33.1982171Z if contiguous: 2025-05-07T20:32:33.1982401Z x0 = x0.contiguous() 2025-05-07T20:32:33.1982665Z x1 = x1.contiguous() 2025-05-07T20:32:33.1982910Z 2025-05-07T20:32:33.1983103Z if scale_ub is not None: 2025-05-07T20:32:33.1983381Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.1983713Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.1984021Z ) 2025-05-07T20:32:33.1984211Z else: 2025-05-07T20:32:33.1984424Z scale_ub_tensor = None 2025-05-07T20:32:33.1984677Z 2025-05-07T20:32:33.1984906Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1985229Z op = silu_mul_quant 2025-05-07T20:32:33.1985489Z if compiled: 2025-05-07T20:32:33.1985783Z op = torch.compile(op) 2025-05-07T20:32:33.1986088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1986367Z 2025-05-07T20:32:33.1986560Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.1986728Z 2025-05-07T20:32:33.1986827Z moe/activation_test.py:117: 2025-05-07T20:32:33.1987124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1987455Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.1987744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1988304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.1988865Z return fn(*args, **kwargs) 2025-05-07T20:32:33.1989516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.1990197Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.1990737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.1991407Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.1992061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.1992599Z kernel = self.compile( 2025-05-07T20:32:33.1993150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.1993793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.1994201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1994486Z 2025-05-07T20:32:33.1994698Z self = 2025-05-07T20:32:33.1995843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.1997216Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017c6a0c0>} 2025-05-07T20:32:33.1998580Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.1999589Z context = 2025-05-07T20:32:33.1999870Z 2025-05-07T20:32:33.2000043Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.2000564Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2001035Z module_map=module_map) 2025-05-07T20:32:33.2001405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2001762Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2002022Z E ^ 2025-05-07T20:32:33.2002494Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2002941Z 2025-05-07T20:32:33.2003359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.2003860Z 2025-05-07T20:32:33.2003971Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.2004388Z self=, 2025-05-07T20:32:33.2004786Z T=2048, 2025-05-07T20:32:33.2004979Z D=5120, 2025-05-07T20:32:33.2005166Z scale_ub=1200.0, 2025-05-07T20:32:33.2005427Z contiguous=False, 2025-05-07T20:32:33.2005679Z compiled=True, 2025-05-07T20:32:33.2005887Z ) 2025-05-07T20:32:33.2006210Z self = 2025-05-07T20:32:33.2006697Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.2006963Z 2025-05-07T20:32:33.2007048Z @given( 2025-05-07T20:32:33.2007277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.2007604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.2007911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.2008241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.2008582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.2008873Z ) 2025-05-07T20:32:33.2009220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.2009671Z def test_silu_mul_quant( 2025-05-07T20:32:33.2009917Z self, 2025-05-07T20:32:33.2010106Z T: int, 2025-05-07T20:32:33.2010317Z D: int, 2025-05-07T20:32:33.2010544Z scale_ub: Optional[float], 2025-05-07T20:32:33.2010810Z contiguous: bool, 2025-05-07T20:32:33.2011058Z compiled: bool, 2025-05-07T20:32:33.2011283Z ) -> None: 2025-05-07T20:32:33.2011497Z torch.manual_seed(2025) 2025-05-07T20:32:33.2011736Z 2025-05-07T20:32:33.2012009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.2012351Z 2025-05-07T20:32:33.2012541Z x_sign = torch.sign(x) 2025-05-07T20:32:33.2012830Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.2013246Z x = x_sign * x_clamp 2025-05-07T20:32:33.2013491Z x0 = x[:, :D] 2025-05-07T20:32:33.2013712Z x1 = x[:, D:] 2025-05-07T20:32:33.2013921Z 2025-05-07T20:32:33.2014157Z if contiguous: 2025-05-07T20:32:33.2014393Z x0 = x0.contiguous() 2025-05-07T20:32:33.2014661Z x1 = x1.contiguous() 2025-05-07T20:32:33.2014985Z 2025-05-07T20:32:33.2015197Z if scale_ub is not None: 2025-05-07T20:32:33.2015504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.2015862Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.2016183Z ) 2025-05-07T20:32:33.2016384Z else: 2025-05-07T20:32:33.2016593Z scale_ub_tensor = None 2025-05-07T20:32:33.2016856Z 2025-05-07T20:32:33.2017141Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.2017457Z op = silu_mul_quant 2025-05-07T20:32:33.2017711Z if compiled: 2025-05-07T20:32:33.2017974Z op = torch.compile(op) 2025-05-07T20:32:33.2018277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2018547Z 2025-05-07T20:32:33.2018742Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.2018907Z 2025-05-07T20:32:33.2019018Z moe/activation_test.py:117: 2025-05-07T20:32:33.2019323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2019656Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.2019941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2020491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.2021046Z return fn(*args, **kwargs) 2025-05-07T20:32:33.2021709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.2022395Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.2022922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.2023598Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.2024260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.2024850Z kernel = self.compile( 2025-05-07T20:32:33.2025386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.2026038Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.2026447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2026674Z 2025-05-07T20:32:33.2026879Z self = 2025-05-07T20:32:33.2027945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.2029295Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017c6b2e0>} 2025-05-07T20:32:33.2030625Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.2031639Z context = 2025-05-07T20:32:33.2031926Z 2025-05-07T20:32:33.2032092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.2032619Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2033092Z module_map=module_map) 2025-05-07T20:32:33.2033462Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2033813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2034077Z E ^ 2025-05-07T20:32:33.2034600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2035042Z 2025-05-07T20:32:33.2035611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.2036124Z 2025-05-07T20:32:33.3784882Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.3785371Z self=, 2025-05-07T20:32:33.3785833Z T=4096, 2025-05-07T20:32:33.3786037Z D=5120, 2025-05-07T20:32:33.3786522Z scale_ub=1200.0, 2025-05-07T20:32:33.3786758Z contiguous=True, 2025-05-07T20:32:33.3786992Z compiled=True, 2025-05-07T20:32:33.3787203Z ) 2025-05-07T20:32:33.3787550Z self = 2025-05-07T20:32:33.3788037Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:33.3788311Z 2025-05-07T20:32:33.3788398Z @given( 2025-05-07T20:32:33.3788632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.3788956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.3789270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.3789604Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.3789938Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.3790221Z ) 2025-05-07T20:32:33.3790568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.3791007Z def test_silu_mul_quant( 2025-05-07T20:32:33.3791246Z self, 2025-05-07T20:32:33.3791451Z T: int, 2025-05-07T20:32:33.3791658Z D: int, 2025-05-07T20:32:33.3791873Z scale_ub: Optional[float], 2025-05-07T20:32:33.3792144Z contiguous: bool, 2025-05-07T20:32:33.3792387Z compiled: bool, 2025-05-07T20:32:33.3792614Z ) -> None: 2025-05-07T20:32:33.3792834Z torch.manual_seed(2025) 2025-05-07T20:32:33.3793079Z 2025-05-07T20:32:33.3793343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.3793687Z 2025-05-07T20:32:33.3793885Z x_sign = torch.sign(x) 2025-05-07T20:32:33.3794171Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.3794485Z x = x_sign * x_clamp 2025-05-07T20:32:33.3794735Z x0 = x[:, :D] 2025-05-07T20:32:33.3794957Z x1 = x[:, D:] 2025-05-07T20:32:33.3795163Z 2025-05-07T20:32:33.3795354Z if contiguous: 2025-05-07T20:32:33.3795596Z x0 = x0.contiguous() 2025-05-07T20:32:33.3795905Z x1 = x1.contiguous() 2025-05-07T20:32:33.3796149Z 2025-05-07T20:32:33.3796345Z if scale_ub is not None: 2025-05-07T20:32:33.3796614Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.3796952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.3797264Z ) 2025-05-07T20:32:33.3797458Z else: 2025-05-07T20:32:33.3797672Z scale_ub_tensor = None 2025-05-07T20:32:33.3797926Z 2025-05-07T20:32:33.3798158Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.3798471Z op = silu_mul_quant 2025-05-07T20:32:33.3798730Z if compiled: 2025-05-07T20:32:33.3798976Z op = torch.compile(op) 2025-05-07T20:32:33.3799272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3799551Z 2025-05-07T20:32:33.3799746Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.3799907Z 2025-05-07T20:32:33.3800010Z moe/activation_test.py:117: 2025-05-07T20:32:33.3800308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3800639Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.3800918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.3801470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.3802121Z return fn(*args, **kwargs) 2025-05-07T20:32:33.3802900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.3803584Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.3804120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.3804792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.3805438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.3806015Z kernel = self.compile( 2025-05-07T20:32:33.3806552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.3807204Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.3807595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.3807833Z 2025-05-07T20:32:33.3808044Z self = 2025-05-07T20:32:33.3809108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.3810474Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0172fc860>} 2025-05-07T20:32:33.3811793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.3812804Z context = 2025-05-07T20:32:33.3813198Z 2025-05-07T20:32:33.3813366Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.3813885Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.3814341Z module_map=module_map) 2025-05-07T20:32:33.3814703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.3815056Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.3815313Z E ^ 2025-05-07T20:32:33.3815822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.3816274Z 2025-05-07T20:32:33.3816686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.3817186Z 2025-05-07T20:32:33.3817294Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.3817697Z self=, 2025-05-07T20:32:33.3818100Z T=128, 2025-05-07T20:32:33.3818290Z D=5120, 2025-05-07T20:32:33.3818482Z scale_ub=1200.0, 2025-05-07T20:32:33.3818710Z contiguous=False, 2025-05-07T20:32:33.3818938Z compiled=True, 2025-05-07T20:32:33.3819146Z ) 2025-05-07T20:32:33.6617821Z self = 2025-05-07T20:32:33.6618497Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.6618788Z 2025-05-07T20:32:33.6618876Z @given( 2025-05-07T20:32:33.6619134Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.6619479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.6619805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.6620155Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.6620487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.6620789Z ) 2025-05-07T20:32:33.6621335Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.6621783Z def test_silu_mul_quant( 2025-05-07T20:32:33.6622205Z self, 2025-05-07T20:32:33.6622419Z T: int, 2025-05-07T20:32:33.6622619Z D: int, 2025-05-07T20:32:33.6622850Z scale_ub: Optional[float], 2025-05-07T20:32:33.6623131Z contiguous: bool, 2025-05-07T20:32:33.6623381Z compiled: bool, 2025-05-07T20:32:33.6623612Z ) -> None: 2025-05-07T20:32:33.6623838Z torch.manual_seed(2025) 2025-05-07T20:32:33.6624088Z 2025-05-07T20:32:33.6624456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.6624808Z 2025-05-07T20:32:33.6625013Z x_sign = torch.sign(x) 2025-05-07T20:32:33.6625312Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.6625629Z x = x_sign * x_clamp 2025-05-07T20:32:33.6625880Z x0 = x[:, :D] 2025-05-07T20:32:33.6626100Z x1 = x[:, D:] 2025-05-07T20:32:33.6626318Z 2025-05-07T20:32:33.6626515Z if contiguous: 2025-05-07T20:32:33.6626753Z x0 = x0.contiguous() 2025-05-07T20:32:33.6627031Z x1 = x1.contiguous() 2025-05-07T20:32:33.6627279Z 2025-05-07T20:32:33.6627475Z if scale_ub is not None: 2025-05-07T20:32:33.6627755Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.6628097Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.6628413Z ) 2025-05-07T20:32:33.6628608Z else: 2025-05-07T20:32:33.6628826Z scale_ub_tensor = None 2025-05-07T20:32:33.6629089Z 2025-05-07T20:32:33.6629324Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.6629642Z op = silu_mul_quant 2025-05-07T20:32:33.6629897Z if compiled: 2025-05-07T20:32:33.6630151Z op = torch.compile(op) 2025-05-07T20:32:33.6630457Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6630741Z 2025-05-07T20:32:33.6630939Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.6631110Z 2025-05-07T20:32:33.6631214Z moe/activation_test.py:117: 2025-05-07T20:32:33.6631524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6631857Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.6632147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6632714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.6633283Z return fn(*args, **kwargs) 2025-05-07T20:32:33.6633944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.6634632Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.6635172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.6635844Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.6636512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.6637046Z kernel = self.compile( 2025-05-07T20:32:33.6637595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.6638239Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.6638641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6638872Z 2025-05-07T20:32:33.6639092Z self = 2025-05-07T20:32:33.6640175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.6641674Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0172fd580>} 2025-05-07T20:32:33.6643010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.6644028Z context = 2025-05-07T20:32:33.6644314Z 2025-05-07T20:32:33.6644490Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.6645049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.6645523Z module_map=module_map) 2025-05-07T20:32:33.6645894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.6646253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.6646512Z E ^ 2025-05-07T20:32:33.6646981Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.6647427Z 2025-05-07T20:32:33.6647850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.6648358Z 2025-05-07T20:32:33.6648471Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.6648886Z self=, 2025-05-07T20:32:33.6649295Z T=16384, 2025-05-07T20:32:33.6649499Z D=7168, 2025-05-07T20:32:33.6649698Z scale_ub=1200.0, 2025-05-07T20:32:33.6649936Z contiguous=True, 2025-05-07T20:32:33.6650167Z compiled=True, 2025-05-07T20:32:33.6650387Z ) 2025-05-07T20:32:33.6650719Z self = 2025-05-07T20:32:33.6651225Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:33.6651504Z 2025-05-07T20:32:33.6651587Z @given( 2025-05-07T20:32:33.6651840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.6652169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.6652485Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.6652822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.6653269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.6653568Z ) 2025-05-07T20:32:33.6653922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.6654383Z def test_silu_mul_quant( 2025-05-07T20:32:33.6654634Z self, 2025-05-07T20:32:33.6654840Z T: int, 2025-05-07T20:32:33.6655056Z D: int, 2025-05-07T20:32:33.6655282Z scale_ub: Optional[float], 2025-05-07T20:32:33.6655583Z contiguous: bool, 2025-05-07T20:32:33.6655829Z compiled: bool, 2025-05-07T20:32:33.6656071Z ) -> None: 2025-05-07T20:32:33.6656298Z torch.manual_seed(2025) 2025-05-07T20:32:33.6656547Z 2025-05-07T20:32:33.6656844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.6657204Z 2025-05-07T20:32:33.6657423Z x_sign = torch.sign(x) 2025-05-07T20:32:33.6657731Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.6658053Z x = x_sign * x_clamp 2025-05-07T20:32:33.6658306Z x0 = x[:, :D] 2025-05-07T20:32:33.6658543Z x1 = x[:, D:] 2025-05-07T20:32:33.6658773Z 2025-05-07T20:32:33.6658972Z if contiguous: 2025-05-07T20:32:33.6659430Z x0 = x0.contiguous() 2025-05-07T20:32:33.6659722Z x1 = x1.contiguous() 2025-05-07T20:32:33.6659977Z 2025-05-07T20:32:33.6660183Z if scale_ub is not None: 2025-05-07T20:32:33.6660475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.6669626Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.6670066Z ) 2025-05-07T20:32:33.6670264Z else: 2025-05-07T20:32:33.6670483Z scale_ub_tensor = None 2025-05-07T20:32:33.6670870Z 2025-05-07T20:32:33.6671111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.6671432Z op = silu_mul_quant 2025-05-07T20:32:33.6671690Z if compiled: 2025-05-07T20:32:33.6671934Z op = torch.compile(op) 2025-05-07T20:32:33.6672235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6672511Z 2025-05-07T20:32:33.6672704Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.6672940Z 2025-05-07T20:32:33.6673042Z moe/activation_test.py:117: 2025-05-07T20:32:33.6673345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6673681Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.6673961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6674521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.6675087Z return fn(*args, **kwargs) 2025-05-07T20:32:33.6675743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.6676421Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.6676954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.6677632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.6678291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.6678824Z kernel = self.compile( 2025-05-07T20:32:33.6679373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.6680026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.6680422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6680662Z 2025-05-07T20:32:33.6680874Z self = 2025-05-07T20:32:33.6681941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.6683299Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0172fe0c0>} 2025-05-07T20:32:33.6684618Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.6685685Z context = 2025-05-07T20:32:33.6685980Z 2025-05-07T20:32:33.6686148Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.6686666Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.6687123Z module_map=module_map) 2025-05-07T20:32:33.6687488Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.6687844Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.6688101Z E ^ 2025-05-07T20:32:33.6688568Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.6689020Z 2025-05-07T20:32:33.6689429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.6689932Z 2025-05-07T20:32:33.7908402Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.7909008Z self=, 2025-05-07T20:32:33.7909509Z T=16384, 2025-05-07T20:32:33.7909942Z D=5120, 2025-05-07T20:32:33.7910224Z scale_ub=1200.0, 2025-05-07T20:32:33.7910530Z contiguous=True, 2025-05-07T20:32:33.7910814Z compiled=False, 2025-05-07T20:32:33.7911027Z ) 2025-05-07T20:32:33.7911347Z self = 2025-05-07T20:32:33.7911838Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:33.7912122Z 2025-05-07T20:32:33.7912275Z @given( 2025-05-07T20:32:33.7912511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.7912823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.7913133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.7913465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.7913790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.7914084Z ) 2025-05-07T20:32:33.7914433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.7914906Z def test_silu_mul_quant( 2025-05-07T20:32:33.7915184Z self, 2025-05-07T20:32:33.7915385Z T: int, 2025-05-07T20:32:33.7915591Z D: int, 2025-05-07T20:32:33.7915810Z scale_ub: Optional[float], 2025-05-07T20:32:33.7916082Z contiguous: bool, 2025-05-07T20:32:33.7916337Z compiled: bool, 2025-05-07T20:32:33.7916563Z ) -> None: 2025-05-07T20:32:33.7916785Z torch.manual_seed(2025) 2025-05-07T20:32:33.7917030Z 2025-05-07T20:32:33.7917304Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.7917652Z 2025-05-07T20:32:33.7917853Z x_sign = torch.sign(x) 2025-05-07T20:32:33.7918141Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.7918452Z x = x_sign * x_clamp 2025-05-07T20:32:33.7918705Z x0 = x[:, :D] 2025-05-07T20:32:33.7918923Z x1 = x[:, D:] 2025-05-07T20:32:33.7919136Z 2025-05-07T20:32:33.7919329Z if contiguous: 2025-05-07T20:32:33.7919566Z x0 = x0.contiguous() 2025-05-07T20:32:33.7919832Z x1 = x1.contiguous() 2025-05-07T20:32:33.7920073Z 2025-05-07T20:32:33.7920273Z if scale_ub is not None: 2025-05-07T20:32:33.7920547Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.7920884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.7921192Z ) 2025-05-07T20:32:33.7921385Z else: 2025-05-07T20:32:33.7921601Z scale_ub_tensor = None 2025-05-07T20:32:33.7921859Z 2025-05-07T20:32:33.7922084Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.7922400Z op = silu_mul_quant 2025-05-07T20:32:33.7922651Z if compiled: 2025-05-07T20:32:33.7922899Z op = torch.compile(op) 2025-05-07T20:32:33.7923208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.7923488Z 2025-05-07T20:32:33.7923679Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.7923856Z 2025-05-07T20:32:33.7923955Z moe/activation_test.py:117: 2025-05-07T20:32:33.7924254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.7924581Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.7924863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.7925600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.7926289Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.7926819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.7927500Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.7928160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.7928742Z kernel = self.compile( 2025-05-07T20:32:33.7929354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.7930013Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.7930410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.7930636Z 2025-05-07T20:32:33.7930851Z self = 2025-05-07T20:32:33.7931953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.7933428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0172ff1a0>} 2025-05-07T20:32:33.7934776Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.7935845Z context = 2025-05-07T20:32:33.7936132Z 2025-05-07T20:32:33.7936301Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.7936829Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.7937308Z module_map=module_map) 2025-05-07T20:32:33.7937682Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.7938033Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.7938302Z E ^ 2025-05-07T20:32:33.7938772Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.7939223Z 2025-05-07T20:32:33.7939644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.7940167Z 2025-05-07T20:32:33.7940279Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.7940809Z self=, 2025-05-07T20:32:33.7941267Z T=1, 2025-05-07T20:32:33.7941471Z D=7168, 2025-05-07T20:32:33.7941682Z scale_ub=1200.0, 2025-05-07T20:32:33.7941937Z contiguous=False, 2025-05-07T20:32:33.7942187Z compiled=False, 2025-05-07T20:32:33.7942400Z ) 2025-05-07T20:32:33.7942735Z self = 2025-05-07T20:32:33.7943232Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:33.7943509Z 2025-05-07T20:32:33.7943595Z @given( 2025-05-07T20:32:33.7943836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.7944164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.7944504Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.7944838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.7945181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.7945480Z ) 2025-05-07T20:32:33.7945833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.7946288Z def test_silu_mul_quant( 2025-05-07T20:32:33.7946548Z self, 2025-05-07T20:32:33.7946757Z T: int, 2025-05-07T20:32:33.7946967Z D: int, 2025-05-07T20:32:33.7947207Z scale_ub: Optional[float], 2025-05-07T20:32:33.7947487Z contiguous: bool, 2025-05-07T20:32:33.7947739Z compiled: bool, 2025-05-07T20:32:33.7947974Z ) -> None: 2025-05-07T20:32:33.7948202Z torch.manual_seed(2025) 2025-05-07T20:32:33.7948460Z 2025-05-07T20:32:33.7948805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.7949159Z 2025-05-07T20:32:33.7949474Z x_sign = torch.sign(x) 2025-05-07T20:32:33.7949787Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.7950114Z x = x_sign * x_clamp 2025-05-07T20:32:33.7950362Z x0 = x[:, :D] 2025-05-07T20:32:33.7950605Z x1 = x[:, D:] 2025-05-07T20:32:33.7950830Z 2025-05-07T20:32:33.7951032Z if contiguous: 2025-05-07T20:32:33.7951283Z x0 = x0.contiguous() 2025-05-07T20:32:33.7951644Z x1 = x1.contiguous() 2025-05-07T20:32:33.7952008Z 2025-05-07T20:32:33.7952282Z if scale_ub is not None: 2025-05-07T20:32:33.7952572Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.7952910Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.7953227Z ) 2025-05-07T20:32:33.7953435Z else: 2025-05-07T20:32:33.7953645Z scale_ub_tensor = None 2025-05-07T20:32:33.7953914Z 2025-05-07T20:32:33.7954153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.7954477Z op = silu_mul_quant 2025-05-07T20:32:33.7954750Z if compiled: 2025-05-07T20:32:33.7955035Z op = torch.compile(op) 2025-05-07T20:32:33.7955358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.7955642Z 2025-05-07T20:32:33.7955852Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.7956020Z 2025-05-07T20:32:33.7956120Z moe/activation_test.py:117: 2025-05-07T20:32:33.7956430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.7956787Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.7957075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.7957771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.7958465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.7959003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.7959922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.7960591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.7961123Z kernel = self.compile( 2025-05-07T20:32:33.7961657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.7962314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.7962722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.7962947Z 2025-05-07T20:32:33.7963160Z self = 2025-05-07T20:32:33.7964225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.7965583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016dbc680>} 2025-05-07T20:32:33.7966911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.7967924Z context = 2025-05-07T20:32:33.7968207Z 2025-05-07T20:32:33.7968380Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.7968896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.7969470Z module_map=module_map) 2025-05-07T20:32:33.7969843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.7970303Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.7970576Z E ^ 2025-05-07T20:32:33.7971039Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.7971485Z 2025-05-07T20:32:33.7971906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.7972469Z 2025-05-07T20:32:33.9719831Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9720308Z self=, 2025-05-07T20:32:33.9720845Z T=4096, 2025-05-07T20:32:33.9721046Z D=7168, 2025-05-07T20:32:33.9721238Z scale_ub=1200.0, 2025-05-07T20:32:33.9721471Z contiguous=False, 2025-05-07T20:32:33.9721703Z compiled=True, 2025-05-07T20:32:33.9721918Z ) 2025-05-07T20:32:33.9722241Z self = 2025-05-07T20:32:33.9722742Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.9723015Z 2025-05-07T20:32:33.9723099Z @given( 2025-05-07T20:32:33.9723330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.9723643Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.9723953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.9724286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.9724628Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.9724923Z ) 2025-05-07T20:32:33.9725319Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.9725766Z def test_silu_mul_quant( 2025-05-07T20:32:33.9726014Z self, 2025-05-07T20:32:33.9726212Z T: int, 2025-05-07T20:32:33.9726422Z D: int, 2025-05-07T20:32:33.9726652Z scale_ub: Optional[float], 2025-05-07T20:32:33.9726929Z contiguous: bool, 2025-05-07T20:32:33.9727190Z compiled: bool, 2025-05-07T20:32:33.9727416Z ) -> None: 2025-05-07T20:32:33.9727628Z torch.manual_seed(2025) 2025-05-07T20:32:33.9727876Z 2025-05-07T20:32:33.9728158Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.9728504Z 2025-05-07T20:32:33.9728693Z x_sign = torch.sign(x) 2025-05-07T20:32:33.9729010Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.9729329Z x = x_sign * x_clamp 2025-05-07T20:32:33.9729571Z x0 = x[:, :D] 2025-05-07T20:32:33.9729795Z x1 = x[:, D:] 2025-05-07T20:32:33.9730009Z 2025-05-07T20:32:33.9730193Z if contiguous: 2025-05-07T20:32:33.9730430Z x0 = x0.contiguous() 2025-05-07T20:32:33.9730691Z x1 = x1.contiguous() 2025-05-07T20:32:33.9730941Z 2025-05-07T20:32:33.9731141Z if scale_ub is not None: 2025-05-07T20:32:33.9731414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.9731751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.9732050Z ) 2025-05-07T20:32:33.9732242Z else: 2025-05-07T20:32:33.9732453Z scale_ub_tensor = None 2025-05-07T20:32:33.9732697Z 2025-05-07T20:32:33.9732937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9733338Z op = silu_mul_quant 2025-05-07T20:32:33.9733580Z if compiled: 2025-05-07T20:32:33.9733827Z op = torch.compile(op) 2025-05-07T20:32:33.9734126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9734391Z 2025-05-07T20:32:33.9734589Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.9734750Z 2025-05-07T20:32:33.9734856Z moe/activation_test.py:117: 2025-05-07T20:32:33.9735149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9735588Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.9735868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9736539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.9737091Z return fn(*args, **kwargs) 2025-05-07T20:32:33.9737745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.9738417Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.9739009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.9739782Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.9740560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.9741187Z kernel = self.compile( 2025-05-07T20:32:33.9741811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.9742588Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.9743050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9743311Z 2025-05-07T20:32:33.9743548Z self = 2025-05-07T20:32:33.9744847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.9746534Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016dbd940>} 2025-05-07T20:32:33.9748184Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.9749407Z context = 2025-05-07T20:32:33.9749691Z 2025-05-07T20:32:33.9749861Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.9750367Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.9750828Z module_map=module_map) 2025-05-07T20:32:33.9751196Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.9751539Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.9751795Z E ^ 2025-05-07T20:32:33.9752259Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.9752700Z 2025-05-07T20:32:33.9753116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.9753619Z 2025-05-07T20:32:33.9753727Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9754134Z self=, 2025-05-07T20:32:33.9754536Z T=128, 2025-05-07T20:32:33.9754727Z D=7168, 2025-05-07T20:32:33.9754928Z scale_ub=1200.0, 2025-05-07T20:32:33.9755170Z contiguous=False, 2025-05-07T20:32:33.9755422Z compiled=True, 2025-05-07T20:32:33.9755629Z ) 2025-05-07T20:32:34.0669053Z self = 2025-05-07T20:32:34.0669600Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.0669965Z 2025-05-07T20:32:34.0670082Z @given( 2025-05-07T20:32:34.0670394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.0670710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.0671128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.0671458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.0671894Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.0672188Z ) 2025-05-07T20:32:34.0672541Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.0672985Z def test_silu_mul_quant( 2025-05-07T20:32:34.0673233Z self, 2025-05-07T20:32:34.0673434Z T: int, 2025-05-07T20:32:34.0673641Z D: int, 2025-05-07T20:32:34.0673863Z scale_ub: Optional[float], 2025-05-07T20:32:34.0674193Z contiguous: bool, 2025-05-07T20:32:34.0674435Z compiled: bool, 2025-05-07T20:32:34.0674661Z ) -> None: 2025-05-07T20:32:34.0674882Z torch.manual_seed(2025) 2025-05-07T20:32:34.0675125Z 2025-05-07T20:32:34.0675403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.0675739Z 2025-05-07T20:32:34.0675945Z x_sign = torch.sign(x) 2025-05-07T20:32:34.0676242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.0676554Z x = x_sign * x_clamp 2025-05-07T20:32:34.0676796Z x0 = x[:, :D] 2025-05-07T20:32:34.0677014Z x1 = x[:, D:] 2025-05-07T20:32:34.0677214Z 2025-05-07T20:32:34.0677404Z if contiguous: 2025-05-07T20:32:34.0677635Z x0 = x0.contiguous() 2025-05-07T20:32:34.0677892Z x1 = x1.contiguous() 2025-05-07T20:32:34.0678132Z 2025-05-07T20:32:34.0678330Z if scale_ub is not None: 2025-05-07T20:32:34.0678610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.0678937Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.0679245Z ) 2025-05-07T20:32:34.0679446Z else: 2025-05-07T20:32:34.0679654Z scale_ub_tensor = None 2025-05-07T20:32:34.0679907Z 2025-05-07T20:32:34.0680139Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.0680451Z op = silu_mul_quant 2025-05-07T20:32:34.0680702Z if compiled: 2025-05-07T20:32:34.0680957Z op = torch.compile(op) 2025-05-07T20:32:34.0681246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.0681525Z 2025-05-07T20:32:34.0681717Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.0681882Z 2025-05-07T20:32:34.0681979Z moe/activation_test.py:117: 2025-05-07T20:32:34.0682280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.0682608Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.0682894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.0683440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.0683994Z return fn(*args, **kwargs) 2025-05-07T20:32:34.0684652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.0685328Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.0685860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.0686541Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.0687203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.0687726Z kernel = self.compile( 2025-05-07T20:32:34.0688262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.0688916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.0689309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.0689535Z 2025-05-07T20:32:34.0689746Z self = 2025-05-07T20:32:34.0690957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.0692306Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016dbe700>} 2025-05-07T20:32:34.0693711Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.0694776Z context = 2025-05-07T20:32:34.0695066Z 2025-05-07T20:32:34.0695230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.0695741Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.0696207Z module_map=module_map) 2025-05-07T20:32:34.0696571Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.0696925Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.0697186Z E ^ 2025-05-07T20:32:34.0697637Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.0698092Z 2025-05-07T20:32:34.0698502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.0706303Z 2025-05-07T20:32:34.0706415Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.0706834Z self=, 2025-05-07T20:32:34.0707239Z T=2048, 2025-05-07T20:32:34.0707423Z D=7168, 2025-05-07T20:32:34.0707622Z scale_ub=None, 2025-05-07T20:32:34.0707836Z contiguous=True, 2025-05-07T20:32:34.0708061Z compiled=True, 2025-05-07T20:32:34.0708270Z ) 2025-05-07T20:32:34.0708595Z self = 2025-05-07T20:32:34.0709077Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.0709345Z 2025-05-07T20:32:34.0709424Z @given( 2025-05-07T20:32:34.0709653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.0709962Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.0710263Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.0710597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.0710921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.0711201Z ) 2025-05-07T20:32:34.0711548Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.0711993Z def test_silu_mul_quant( 2025-05-07T20:32:34.0712228Z self, 2025-05-07T20:32:34.0712425Z T: int, 2025-05-07T20:32:34.0712628Z D: int, 2025-05-07T20:32:34.0712845Z scale_ub: Optional[float], 2025-05-07T20:32:34.0713126Z contiguous: bool, 2025-05-07T20:32:34.0713372Z compiled: bool, 2025-05-07T20:32:34.0713592Z ) -> None: 2025-05-07T20:32:34.0713808Z torch.manual_seed(2025) 2025-05-07T20:32:34.0714048Z 2025-05-07T20:32:34.0714322Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.0714664Z 2025-05-07T20:32:34.0714865Z x_sign = torch.sign(x) 2025-05-07T20:32:34.0715178Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.0715480Z x = x_sign * x_clamp 2025-05-07T20:32:34.0715724Z x0 = x[:, :D] 2025-05-07T20:32:34.0715948Z x1 = x[:, D:] 2025-05-07T20:32:34.0716152Z 2025-05-07T20:32:34.0716348Z if contiguous: 2025-05-07T20:32:34.0716587Z x0 = x0.contiguous() 2025-05-07T20:32:34.0716847Z x1 = x1.contiguous() 2025-05-07T20:32:34.0717157Z 2025-05-07T20:32:34.0717350Z if scale_ub is not None: 2025-05-07T20:32:34.0717702Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.0718034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.0718342Z ) 2025-05-07T20:32:34.0718536Z else: 2025-05-07T20:32:34.0718739Z scale_ub_tensor = None 2025-05-07T20:32:34.0718988Z 2025-05-07T20:32:34.0719223Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.0719529Z op = silu_mul_quant 2025-05-07T20:32:34.0719820Z if compiled: 2025-05-07T20:32:34.0720075Z op = torch.compile(op) 2025-05-07T20:32:34.0720366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.0720635Z 2025-05-07T20:32:34.0720825Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.0720989Z 2025-05-07T20:32:34.0721096Z moe/activation_test.py:117: 2025-05-07T20:32:34.0721382Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.0721706Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.0721987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.0722538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.0723091Z return fn(*args, **kwargs) 2025-05-07T20:32:34.0723743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.0724418Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.0724941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.0725610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.0726255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.0726772Z kernel = self.compile( 2025-05-07T20:32:34.0727315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.0727960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.0728356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.0728577Z 2025-05-07T20:32:34.0728780Z self = 2025-05-07T20:32:34.0729847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.0731205Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016dbf7e0>} 2025-05-07T20:32:34.0732528Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.0733594Z context = 2025-05-07T20:32:34.0733882Z 2025-05-07T20:32:34.0734050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.0734561Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.0735030Z module_map=module_map) 2025-05-07T20:32:34.0735381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.0735733Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.0735994Z E ^ 2025-05-07T20:32:34.0736447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.0736942Z 2025-05-07T20:32:34.0737423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.0737925Z 2025-05-07T20:32:34.1333681Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1334131Z self=, 2025-05-07T20:32:34.1334566Z T=16384, 2025-05-07T20:32:34.1334824Z D=5120, 2025-05-07T20:32:34.1335093Z scale_ub=None, 2025-05-07T20:32:34.1335339Z contiguous=False, 2025-05-07T20:32:34.1335558Z compiled=False, 2025-05-07T20:32:34.1335891Z ) 2025-05-07T20:32:34.1336209Z self = 2025-05-07T20:32:34.1336693Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.1336975Z 2025-05-07T20:32:34.1337056Z @given( 2025-05-07T20:32:34.1337287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1337603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1337902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1338246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1338571Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1338850Z ) 2025-05-07T20:32:34.1339205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1339647Z def test_silu_mul_quant( 2025-05-07T20:32:34.1339881Z self, 2025-05-07T20:32:34.1340073Z T: int, 2025-05-07T20:32:34.1340275Z D: int, 2025-05-07T20:32:34.1340488Z scale_ub: Optional[float], 2025-05-07T20:32:34.1340753Z contiguous: bool, 2025-05-07T20:32:34.1340993Z compiled: bool, 2025-05-07T20:32:34.1341215Z ) -> None: 2025-05-07T20:32:34.1341430Z torch.manual_seed(2025) 2025-05-07T20:32:34.1341672Z 2025-05-07T20:32:34.1341945Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1342285Z 2025-05-07T20:32:34.1342483Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1342774Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1344760Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.1346605Z 2025-05-07T20:32:34.1346728Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.1346935Z 2025-05-07T20:32:34.1347036Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1347447Z self=, 2025-05-07T20:32:34.1347839Z T=4096, 2025-05-07T20:32:34.1348021Z D=7168, 2025-05-07T20:32:34.1348210Z scale_ub=1200.0, 2025-05-07T20:32:34.1348427Z contiguous=True, 2025-05-07T20:32:34.1348668Z compiled=True, 2025-05-07T20:32:34.1348862Z ) 2025-05-07T20:32:34.1349177Z self = 2025-05-07T20:32:34.1349661Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.1349928Z 2025-05-07T20:32:34.1350014Z @given( 2025-05-07T20:32:34.1350237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1350547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1350857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1351180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1351502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1351850Z ) 2025-05-07T20:32:34.1352185Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1352727Z def test_silu_mul_quant( 2025-05-07T20:32:34.1352972Z self, 2025-05-07T20:32:34.1353167Z T: int, 2025-05-07T20:32:34.1353362Z D: int, 2025-05-07T20:32:34.1353587Z scale_ub: Optional[float], 2025-05-07T20:32:34.1353850Z contiguous: bool, 2025-05-07T20:32:34.1354086Z compiled: bool, 2025-05-07T20:32:34.1354312Z ) -> None: 2025-05-07T20:32:34.1354527Z torch.manual_seed(2025) 2025-05-07T20:32:34.1354809Z 2025-05-07T20:32:34.1355075Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1355457Z 2025-05-07T20:32:34.1355644Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1355930Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1357907Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.1359992Z 2025-05-07T20:32:34.1360119Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.1360330Z 2025-05-07T20:32:34.1360438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1360836Z self=, 2025-05-07T20:32:34.1361229Z T=16384, 2025-05-07T20:32:34.1361423Z D=7168, 2025-05-07T20:32:34.1361607Z scale_ub=None, 2025-05-07T20:32:34.1361817Z contiguous=False, 2025-05-07T20:32:34.1362040Z compiled=False, 2025-05-07T20:32:34.1362240Z ) 2025-05-07T20:32:34.1362549Z self = 2025-05-07T20:32:34.1363044Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.1363316Z 2025-05-07T20:32:34.1363394Z @given( 2025-05-07T20:32:34.1363629Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1363934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1364235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1364555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1364878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1365156Z ) 2025-05-07T20:32:34.1365546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1365988Z def test_silu_mul_quant( 2025-05-07T20:32:34.1366224Z self, 2025-05-07T20:32:34.1366426Z T: int, 2025-05-07T20:32:34.1366619Z D: int, 2025-05-07T20:32:34.1366832Z scale_ub: Optional[float], 2025-05-07T20:32:34.1367106Z contiguous: bool, 2025-05-07T20:32:34.1367354Z compiled: bool, 2025-05-07T20:32:34.1367569Z ) -> None: 2025-05-07T20:32:34.1367779Z torch.manual_seed(2025) 2025-05-07T20:32:34.1368014Z 2025-05-07T20:32:34.1368279Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1370297Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.1372219Z 2025-05-07T20:32:34.1372335Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.1372679Z 2025-05-07T20:32:34.1372786Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1373276Z self=, 2025-05-07T20:32:34.1373671Z T=2048, 2025-05-07T20:32:34.1373857Z D=7168, 2025-05-07T20:32:34.1374045Z scale_ub=1200.0, 2025-05-07T20:32:34.1374259Z contiguous=True, 2025-05-07T20:32:34.1374481Z compiled=True, 2025-05-07T20:32:34.1374750Z ) 2025-05-07T20:32:34.1375089Z self = 2025-05-07T20:32:34.1375596Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.1375867Z 2025-05-07T20:32:34.1375941Z @given( 2025-05-07T20:32:34.1376167Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1376467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1376775Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1377100Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1377417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1377697Z ) 2025-05-07T20:32:34.1378043Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1378470Z def test_silu_mul_quant( 2025-05-07T20:32:34.1378707Z self, 2025-05-07T20:32:34.1378895Z T: int, 2025-05-07T20:32:34.1379096Z D: int, 2025-05-07T20:32:34.1379309Z scale_ub: Optional[float], 2025-05-07T20:32:34.1379575Z contiguous: bool, 2025-05-07T20:32:34.1379810Z compiled: bool, 2025-05-07T20:32:34.1380024Z ) -> None: 2025-05-07T20:32:34.1380239Z torch.manual_seed(2025) 2025-05-07T20:32:34.1380477Z 2025-05-07T20:32:34.1380736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1381070Z 2025-05-07T20:32:34.1381263Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1381546Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1383512Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.1385331Z 2025-05-07T20:32:34.1385446Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.1385656Z 2025-05-07T20:32:34.1385758Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1386160Z self=, 2025-05-07T20:32:34.1386549Z T=2048, 2025-05-07T20:32:34.1386740Z D=7168, 2025-05-07T20:32:34.1386929Z scale_ub=None, 2025-05-07T20:32:34.1387134Z contiguous=True, 2025-05-07T20:32:34.1387349Z compiled=False, 2025-05-07T20:32:34.1387551Z ) 2025-05-07T20:32:34.2520637Z self = 2025-05-07T20:32:34.2522016Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.2522585Z 2025-05-07T20:32:34.2522737Z @given( 2025-05-07T20:32:34.2523198Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.2523808Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.2524394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.2525032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.2525413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.2525826Z ) 2025-05-07T20:32:34.2526167Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.2526712Z def test_silu_mul_quant( 2025-05-07T20:32:34.2526945Z self, 2025-05-07T20:32:34.2527141Z T: int, 2025-05-07T20:32:34.2527350Z D: int, 2025-05-07T20:32:34.2527558Z scale_ub: Optional[float], 2025-05-07T20:32:34.2527838Z contiguous: bool, 2025-05-07T20:32:34.2528083Z compiled: bool, 2025-05-07T20:32:34.2528300Z ) -> None: 2025-05-07T20:32:34.2528515Z torch.manual_seed(2025) 2025-05-07T20:32:34.2528826Z 2025-05-07T20:32:34.2529094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.2529422Z 2025-05-07T20:32:34.2529616Z > x_sign = torch.sign(x) 2025-05-07T20:32:34.2531563Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.2533460Z 2025-05-07T20:32:34.2533578Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:34.2533786Z 2025-05-07T20:32:34.2533885Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.2534293Z self=, 2025-05-07T20:32:34.2534682Z T=1, 2025-05-07T20:32:34.2534862Z D=7168, 2025-05-07T20:32:34.2535050Z scale_ub=1200.0, 2025-05-07T20:32:34.2535266Z contiguous=True, 2025-05-07T20:32:34.2535490Z compiled=False, 2025-05-07T20:32:34.2535686Z ) 2025-05-07T20:32:34.2535997Z self = 2025-05-07T20:32:34.2536473Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.2536735Z 2025-05-07T20:32:34.2536814Z @given( 2025-05-07T20:32:34.2537037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.2537340Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.2537638Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.2537965Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.2538288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.2538566Z ) 2025-05-07T20:32:34.2538910Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.2539344Z def test_silu_mul_quant( 2025-05-07T20:32:34.2539589Z self, 2025-05-07T20:32:34.2539776Z T: int, 2025-05-07T20:32:34.2539970Z D: int, 2025-05-07T20:32:34.2540184Z scale_ub: Optional[float], 2025-05-07T20:32:34.2540447Z contiguous: bool, 2025-05-07T20:32:34.2540691Z compiled: bool, 2025-05-07T20:32:34.2540916Z ) -> None: 2025-05-07T20:32:34.2541127Z torch.manual_seed(2025) 2025-05-07T20:32:34.2541369Z 2025-05-07T20:32:34.2541635Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.2541963Z 2025-05-07T20:32:34.2542157Z x_sign = torch.sign(x) 2025-05-07T20:32:34.2542446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.2542751Z x = x_sign * x_clamp 2025-05-07T20:32:34.2542988Z x0 = x[:, :D] 2025-05-07T20:32:34.2543206Z x1 = x[:, D:] 2025-05-07T20:32:34.2543410Z 2025-05-07T20:32:34.2543590Z if contiguous: 2025-05-07T20:32:34.2543818Z x0 = x0.contiguous() 2025-05-07T20:32:34.2544072Z x1 = x1.contiguous() 2025-05-07T20:32:34.2544303Z 2025-05-07T20:32:34.2544495Z if scale_ub is not None: 2025-05-07T20:32:34.2544770Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.2545180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.2545495Z ) 2025-05-07T20:32:34.2545759Z else: 2025-05-07T20:32:34.2545967Z scale_ub_tensor = None 2025-05-07T20:32:34.2546224Z 2025-05-07T20:32:34.2546459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.2546767Z op = silu_mul_quant 2025-05-07T20:32:34.2547011Z if compiled: 2025-05-07T20:32:34.2547258Z op = torch.compile(op) 2025-05-07T20:32:34.2547544Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2547865Z 2025-05-07T20:32:34.2548069Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.2548235Z 2025-05-07T20:32:34.2548336Z moe/activation_test.py:117: 2025-05-07T20:32:34.2548624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2548945Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.2549225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.2549906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.2550583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.2551108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.2551772Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.2552425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.2552952Z kernel = self.compile( 2025-05-07T20:32:34.2553485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.2554122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.2554517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.2554754Z 2025-05-07T20:32:34.2554965Z self = 2025-05-07T20:32:34.2556077Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.2557421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016cd2b60>} 2025-05-07T20:32:34.2558743Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.2559936Z context = 2025-05-07T20:32:34.2560222Z 2025-05-07T20:32:34.2560404Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.2560915Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.2561376Z module_map=module_map) 2025-05-07T20:32:34.2561744Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.2562100Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.2562349Z E ^ 2025-05-07T20:32:34.2562807Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.2563258Z 2025-05-07T20:32:34.2563666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.2564166Z 2025-05-07T20:32:34.2564271Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.2564670Z self=, 2025-05-07T20:32:34.2565184Z T=128, 2025-05-07T20:32:34.2565372Z D=5120, 2025-05-07T20:32:34.2565580Z scale_ub=None, 2025-05-07T20:32:34.2565930Z contiguous=True, 2025-05-07T20:32:34.2566151Z compiled=False, 2025-05-07T20:32:34.2566349Z ) 2025-05-07T20:32:34.3243524Z self = 2025-05-07T20:32:34.3244065Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.3244336Z 2025-05-07T20:32:34.3244416Z @given( 2025-05-07T20:32:34.3244644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.3245068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.3245371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.3245699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.3246020Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.3246298Z ) 2025-05-07T20:32:34.3246642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.3247082Z def test_silu_mul_quant( 2025-05-07T20:32:34.3247321Z self, 2025-05-07T20:32:34.3247513Z T: int, 2025-05-07T20:32:34.3247703Z D: int, 2025-05-07T20:32:34.3247914Z scale_ub: Optional[float], 2025-05-07T20:32:34.3248188Z contiguous: bool, 2025-05-07T20:32:34.3248424Z compiled: bool, 2025-05-07T20:32:34.3248646Z ) -> None: 2025-05-07T20:32:34.3248859Z torch.manual_seed(2025) 2025-05-07T20:32:34.3249100Z 2025-05-07T20:32:34.3249370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.3249702Z 2025-05-07T20:32:34.3249892Z x_sign = torch.sign(x) 2025-05-07T20:32:34.3250174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.3250470Z x = x_sign * x_clamp 2025-05-07T20:32:34.3250707Z x0 = x[:, :D] 2025-05-07T20:32:34.3250923Z x1 = x[:, D:] 2025-05-07T20:32:34.3251123Z 2025-05-07T20:32:34.3251314Z if contiguous: 2025-05-07T20:32:34.3251546Z x0 = x0.contiguous() 2025-05-07T20:32:34.3251801Z x1 = x1.contiguous() 2025-05-07T20:32:34.3252042Z 2025-05-07T20:32:34.3252227Z if scale_ub is not None: 2025-05-07T20:32:34.3252490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.3252817Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.3253195Z ) 2025-05-07T20:32:34.3261533Z else: 2025-05-07T20:32:34.3261784Z scale_ub_tensor = None 2025-05-07T20:32:34.3262047Z 2025-05-07T20:32:34.3262285Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.3262600Z op = silu_mul_quant 2025-05-07T20:32:34.3262843Z if compiled: 2025-05-07T20:32:34.3263091Z op = torch.compile(op) 2025-05-07T20:32:34.3263385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3263645Z 2025-05-07T20:32:34.3263832Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.3263992Z 2025-05-07T20:32:34.3264096Z moe/activation_test.py:117: 2025-05-07T20:32:34.3264384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3264708Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.3264985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3265666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.3266335Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.3266863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.3267526Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.3268178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.3268810Z kernel = self.compile( 2025-05-07T20:32:34.3269461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.3270110Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.3270499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3270727Z 2025-05-07T20:32:34.3270935Z self = 2025-05-07T20:32:34.3271995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.3273399Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016cd3c40>} 2025-05-07T20:32:34.3274724Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.3275767Z context = 2025-05-07T20:32:34.3276049Z 2025-05-07T20:32:34.3276209Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.3276712Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.3277173Z module_map=module_map) 2025-05-07T20:32:34.3277526Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.3277869Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.3278123Z E ^ 2025-05-07T20:32:34.3278572Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.3279019Z 2025-05-07T20:32:34.3279428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.3279942Z 2025-05-07T20:32:34.3280043Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.3280449Z self=, 2025-05-07T20:32:34.3280840Z T=128, 2025-05-07T20:32:34.3281025Z D=7168, 2025-05-07T20:32:34.3281213Z scale_ub=None, 2025-05-07T20:32:34.3281413Z contiguous=True, 2025-05-07T20:32:34.3281630Z compiled=False, 2025-05-07T20:32:34.3281835Z ) 2025-05-07T20:32:34.3282146Z self = 2025-05-07T20:32:34.3282623Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.3282878Z 2025-05-07T20:32:34.3282959Z @given( 2025-05-07T20:32:34.3283178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.3283482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.3283779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.3284103Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.3284422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.3284700Z ) 2025-05-07T20:32:34.3285041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.3285471Z def test_silu_mul_quant( 2025-05-07T20:32:34.3285712Z self, 2025-05-07T20:32:34.3285902Z T: int, 2025-05-07T20:32:34.3286088Z D: int, 2025-05-07T20:32:34.3286303Z scale_ub: Optional[float], 2025-05-07T20:32:34.3286572Z contiguous: bool, 2025-05-07T20:32:34.3286804Z compiled: bool, 2025-05-07T20:32:34.3287019Z ) -> None: 2025-05-07T20:32:34.3287226Z torch.manual_seed(2025) 2025-05-07T20:32:34.3287460Z 2025-05-07T20:32:34.3287723Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.3288107Z 2025-05-07T20:32:34.3288298Z x_sign = torch.sign(x) 2025-05-07T20:32:34.3288652Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.3288960Z x = x_sign * x_clamp 2025-05-07T20:32:34.3289194Z x0 = x[:, :D] 2025-05-07T20:32:34.3289402Z x1 = x[:, D:] 2025-05-07T20:32:34.3289609Z 2025-05-07T20:32:34.3289798Z if contiguous: 2025-05-07T20:32:34.3290024Z x0 = x0.contiguous() 2025-05-07T20:32:34.3290283Z x1 = x1.contiguous() 2025-05-07T20:32:34.3290519Z 2025-05-07T20:32:34.3290803Z if scale_ub is not None: 2025-05-07T20:32:34.3291072Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.3291401Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.3291701Z ) 2025-05-07T20:32:34.3291898Z else: 2025-05-07T20:32:34.3292114Z scale_ub_tensor = None 2025-05-07T20:32:34.3292365Z 2025-05-07T20:32:34.3292602Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.3292909Z op = silu_mul_quant 2025-05-07T20:32:34.3293210Z if compiled: 2025-05-07T20:32:34.3293457Z op = torch.compile(op) 2025-05-07T20:32:34.3293748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3294016Z 2025-05-07T20:32:34.3294202Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.3294366Z 2025-05-07T20:32:34.3294464Z moe/activation_test.py:117: 2025-05-07T20:32:34.3294757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3295084Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.3295360Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3296036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.3296719Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.3297247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.3297922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.3298577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.3299092Z kernel = self.compile( 2025-05-07T20:32:34.3299622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.3300261Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.3300656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3300878Z 2025-05-07T20:32:34.3301083Z self = 2025-05-07T20:32:34.3302140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.3303487Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016b70ae0>} 2025-05-07T20:32:34.3304798Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.3305810Z context = 2025-05-07T20:32:34.3306095Z 2025-05-07T20:32:34.3306262Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.3306775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.3307238Z module_map=module_map) 2025-05-07T20:32:34.3307647Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.3307997Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.3308255Z E ^ 2025-05-07T20:32:34.3308786Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.3309226Z 2025-05-07T20:32:34.3309637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.3310139Z 2025-05-07T20:32:34.3310242Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.3310691Z self=, 2025-05-07T20:32:34.3311089Z T=2048, 2025-05-07T20:32:34.3311271Z D=7168, 2025-05-07T20:32:34.3311463Z scale_ub=1200.0, 2025-05-07T20:32:34.3311691Z contiguous=True, 2025-05-07T20:32:34.3311908Z compiled=False, 2025-05-07T20:32:34.3312116Z ) 2025-05-07T20:32:34.4119917Z self = 2025-05-07T20:32:34.4120981Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.4121525Z 2025-05-07T20:32:34.4121682Z @given( 2025-05-07T20:32:34.4122127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4122746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4123341Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4123995Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4124633Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4125167Z ) 2025-05-07T20:32:34.4125507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4125940Z def test_silu_mul_quant( 2025-05-07T20:32:34.4126181Z self, 2025-05-07T20:32:34.4126376Z T: int, 2025-05-07T20:32:34.4126566Z D: int, 2025-05-07T20:32:34.4126782Z scale_ub: Optional[float], 2025-05-07T20:32:34.4127051Z contiguous: bool, 2025-05-07T20:32:34.4127284Z compiled: bool, 2025-05-07T20:32:34.4127507Z ) -> None: 2025-05-07T20:32:34.4127724Z torch.manual_seed(2025) 2025-05-07T20:32:34.4127956Z 2025-05-07T20:32:34.4128227Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4130254Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.4132073Z 2025-05-07T20:32:34.4132194Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.4132402Z 2025-05-07T20:32:34.4132510Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4132917Z self=, 2025-05-07T20:32:34.4133365Z T=1, 2025-05-07T20:32:34.4133544Z D=5120, 2025-05-07T20:32:34.4133724Z scale_ub=1200.0, 2025-05-07T20:32:34.4133939Z contiguous=True, 2025-05-07T20:32:34.4134167Z compiled=False, 2025-05-07T20:32:34.4134364Z ) 2025-05-07T20:32:34.4134675Z self = 2025-05-07T20:32:34.4135162Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.4135418Z 2025-05-07T20:32:34.4135494Z @given( 2025-05-07T20:32:34.4135721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4136024Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4136332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4136763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4137085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4137498Z ) 2025-05-07T20:32:34.4137839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4138277Z def test_silu_mul_quant( 2025-05-07T20:32:34.4138514Z self, 2025-05-07T20:32:34.4138709Z T: int, 2025-05-07T20:32:34.4138906Z D: int, 2025-05-07T20:32:34.4139116Z scale_ub: Optional[float], 2025-05-07T20:32:34.4139379Z contiguous: bool, 2025-05-07T20:32:34.4139687Z compiled: bool, 2025-05-07T20:32:34.4139906Z ) -> None: 2025-05-07T20:32:34.4140112Z torch.manual_seed(2025) 2025-05-07T20:32:34.4140347Z 2025-05-07T20:32:34.4140614Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4140942Z 2025-05-07T20:32:34.4141135Z x_sign = torch.sign(x) 2025-05-07T20:32:34.4141424Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.4141729Z x = x_sign * x_clamp 2025-05-07T20:32:34.4141969Z x0 = x[:, :D] 2025-05-07T20:32:34.4142191Z x1 = x[:, D:] 2025-05-07T20:32:34.4142397Z 2025-05-07T20:32:34.4142589Z if contiguous: 2025-05-07T20:32:34.4142822Z x0 = x0.contiguous() 2025-05-07T20:32:34.4143076Z x1 = x1.contiguous() 2025-05-07T20:32:34.4143310Z 2025-05-07T20:32:34.4143504Z if scale_ub is not None: 2025-05-07T20:32:34.4143777Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.4144108Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.4144418Z ) 2025-05-07T20:32:34.4144606Z else: 2025-05-07T20:32:34.4144812Z scale_ub_tensor = None 2025-05-07T20:32:34.4145070Z 2025-05-07T20:32:34.4145297Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.4145606Z op = silu_mul_quant 2025-05-07T20:32:34.4145855Z if compiled: 2025-05-07T20:32:34.4146107Z op = torch.compile(op) 2025-05-07T20:32:34.4146396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4146662Z 2025-05-07T20:32:34.4146854Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.4147013Z 2025-05-07T20:32:34.4147110Z moe/activation_test.py:117: 2025-05-07T20:32:34.4147401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4147724Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.4148001Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4148679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.4149356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.4149880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.4150541Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.4151199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.4151721Z kernel = self.compile( 2025-05-07T20:32:34.4152254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.4152887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.4153275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4153507Z 2025-05-07T20:32:34.4153713Z self = 2025-05-07T20:32:34.4154776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.4156238Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016b720c0>} 2025-05-07T20:32:34.4157560Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.4158563Z context = 2025-05-07T20:32:34.4158844Z 2025-05-07T20:32:34.4159011Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.4159722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.4160181Z module_map=module_map) 2025-05-07T20:32:34.4160554Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.4160904Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.4161166Z E ^ 2025-05-07T20:32:34.4161633Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.4162077Z 2025-05-07T20:32:34.4162490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.4162987Z 2025-05-07T20:32:34.4163094Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4163498Z self=, 2025-05-07T20:32:34.4163892Z T=2048, 2025-05-07T20:32:34.4164083Z D=5120, 2025-05-07T20:32:34.4164267Z scale_ub=None, 2025-05-07T20:32:34.4164478Z contiguous=True, 2025-05-07T20:32:34.4164697Z compiled=False, 2025-05-07T20:32:34.4164892Z ) 2025-05-07T20:32:34.4165210Z self = 2025-05-07T20:32:34.4165691Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.4165959Z 2025-05-07T20:32:34.4166037Z @given( 2025-05-07T20:32:34.4166272Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4166579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4166883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4167203Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4167533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4167810Z ) 2025-05-07T20:32:34.4168150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4168581Z def test_silu_mul_quant( 2025-05-07T20:32:34.4168816Z self, 2025-05-07T20:32:34.4168999Z T: int, 2025-05-07T20:32:34.4169193Z D: int, 2025-05-07T20:32:34.4169403Z scale_ub: Optional[float], 2025-05-07T20:32:34.4169663Z contiguous: bool, 2025-05-07T20:32:34.4169899Z compiled: bool, 2025-05-07T20:32:34.4170118Z ) -> None: 2025-05-07T20:32:34.4170324Z torch.manual_seed(2025) 2025-05-07T20:32:34.4170557Z 2025-05-07T20:32:34.4170824Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4171158Z 2025-05-07T20:32:34.4171348Z > x_sign = torch.sign(x) 2025-05-07T20:32:34.4173323Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.4175157Z 2025-05-07T20:32:34.4175271Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:34.4175555Z 2025-05-07T20:32:34.4175663Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4176176Z self=, 2025-05-07T20:32:34.4176576Z T=16384, 2025-05-07T20:32:34.4176774Z D=5120, 2025-05-07T20:32:34.4176971Z scale_ub=None, 2025-05-07T20:32:34.4177186Z contiguous=True, 2025-05-07T20:32:34.4177413Z compiled=False, 2025-05-07T20:32:34.4177622Z ) 2025-05-07T20:32:34.4936770Z self = 2025-05-07T20:32:34.4937330Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.4937735Z 2025-05-07T20:32:34.4937812Z @given( 2025-05-07T20:32:34.4938038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4938348Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4938644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4938971Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4939304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4939577Z ) 2025-05-07T20:32:34.4939929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4940361Z def test_silu_mul_quant( 2025-05-07T20:32:34.4940602Z self, 2025-05-07T20:32:34.4940789Z T: int, 2025-05-07T20:32:34.4940985Z D: int, 2025-05-07T20:32:34.4941199Z scale_ub: Optional[float], 2025-05-07T20:32:34.4941465Z contiguous: bool, 2025-05-07T20:32:34.4941704Z compiled: bool, 2025-05-07T20:32:34.4941929Z ) -> None: 2025-05-07T20:32:34.4942141Z torch.manual_seed(2025) 2025-05-07T20:32:34.4942378Z 2025-05-07T20:32:34.4942645Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4944665Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.4946554Z 2025-05-07T20:32:34.4946673Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.4946885Z 2025-05-07T20:32:34.4946987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4947396Z self=, 2025-05-07T20:32:34.4947790Z T=4096, 2025-05-07T20:32:34.4947969Z D=5120, 2025-05-07T20:32:34.4948159Z scale_ub=None, 2025-05-07T20:32:34.4948364Z contiguous=True, 2025-05-07T20:32:34.4948578Z compiled=False, 2025-05-07T20:32:34.4948785Z ) 2025-05-07T20:32:34.4949100Z self = 2025-05-07T20:32:34.4949583Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.4949850Z 2025-05-07T20:32:34.4949926Z @given( 2025-05-07T20:32:34.4950151Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4950458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4950751Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4951071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4951394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4951673Z ) 2025-05-07T20:32:34.4952010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4952444Z def test_silu_mul_quant( 2025-05-07T20:32:34.4952685Z self, 2025-05-07T20:32:34.4952875Z T: int, 2025-05-07T20:32:34.4953068Z D: int, 2025-05-07T20:32:34.4953347Z scale_ub: Optional[float], 2025-05-07T20:32:34.4953616Z contiguous: bool, 2025-05-07T20:32:34.4953853Z compiled: bool, 2025-05-07T20:32:34.4954178Z ) -> None: 2025-05-07T20:32:34.4954398Z torch.manual_seed(2025) 2025-05-07T20:32:34.4954637Z 2025-05-07T20:32:34.4954902Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4956954Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.4958816Z 2025-05-07T20:32:34.4958939Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.4959149Z 2025-05-07T20:32:34.4959419Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4959831Z self=, 2025-05-07T20:32:34.4960223Z T=2048, 2025-05-07T20:32:34.4960407Z D=5120, 2025-05-07T20:32:34.4960591Z scale_ub=None, 2025-05-07T20:32:34.4960796Z contiguous=False, 2025-05-07T20:32:34.4961020Z compiled=False, 2025-05-07T20:32:34.4961220Z ) 2025-05-07T20:32:34.4961532Z self = 2025-05-07T20:32:34.4962019Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.4962286Z 2025-05-07T20:32:34.4962363Z @given( 2025-05-07T20:32:34.4962598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4962908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4963219Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4963552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4963875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4964154Z ) 2025-05-07T20:32:34.4964496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4964924Z def test_silu_mul_quant( 2025-05-07T20:32:34.4965159Z self, 2025-05-07T20:32:34.4965357Z T: int, 2025-05-07T20:32:34.4965583Z D: int, 2025-05-07T20:32:34.4965820Z scale_ub: Optional[float], 2025-05-07T20:32:34.4966092Z contiguous: bool, 2025-05-07T20:32:34.4966325Z compiled: bool, 2025-05-07T20:32:34.4966537Z ) -> None: 2025-05-07T20:32:34.4966745Z torch.manual_seed(2025) 2025-05-07T20:32:34.4966985Z 2025-05-07T20:32:34.4967244Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4969247Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.4971062Z 2025-05-07T20:32:34.4971183Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.4971401Z 2025-05-07T20:32:34.4971503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4971910Z self=, 2025-05-07T20:32:34.4972297Z T=4096, 2025-05-07T20:32:34.4972486Z D=7168, 2025-05-07T20:32:34.4972674Z scale_ub=None, 2025-05-07T20:32:34.4972877Z contiguous=True, 2025-05-07T20:32:34.4973246Z compiled=True, 2025-05-07T20:32:34.4973449Z ) 2025-05-07T20:32:34.4973755Z self = 2025-05-07T20:32:34.4974370Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.4974636Z 2025-05-07T20:32:34.4974719Z @given( 2025-05-07T20:32:34.4974949Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4975256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4975559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4975879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4976261Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4976540Z ) 2025-05-07T20:32:34.4976883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4977318Z def test_silu_mul_quant( 2025-05-07T20:32:34.4977567Z self, 2025-05-07T20:32:34.4977765Z T: int, 2025-05-07T20:32:34.4986532Z D: int, 2025-05-07T20:32:34.4986792Z scale_ub: Optional[float], 2025-05-07T20:32:34.4987070Z contiguous: bool, 2025-05-07T20:32:34.4987313Z compiled: bool, 2025-05-07T20:32:34.4987527Z ) -> None: 2025-05-07T20:32:34.4987739Z torch.manual_seed(2025) 2025-05-07T20:32:34.4987973Z 2025-05-07T20:32:34.4988247Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4990265Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.4992096Z 2025-05-07T20:32:34.4992218Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.4992430Z 2025-05-07T20:32:34.4992535Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4992938Z self=, 2025-05-07T20:32:34.4993321Z T=2048, 2025-05-07T20:32:34.4993503Z D=5120, 2025-05-07T20:32:34.4993686Z scale_ub=1200.0, 2025-05-07T20:32:34.4993907Z contiguous=False, 2025-05-07T20:32:34.4994132Z compiled=False, 2025-05-07T20:32:34.4994350Z ) 2025-05-07T20:32:34.4994660Z self = 2025-05-07T20:32:34.4995144Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.4995415Z 2025-05-07T20:32:34.4995492Z @given( 2025-05-07T20:32:34.4995718Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4996015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4996322Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4996645Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4996959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4997235Z ) 2025-05-07T20:32:34.4997571Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4998002Z def test_silu_mul_quant( 2025-05-07T20:32:34.4998236Z self, 2025-05-07T20:32:34.4998421Z T: int, 2025-05-07T20:32:34.4998618Z D: int, 2025-05-07T20:32:34.4998831Z scale_ub: Optional[float], 2025-05-07T20:32:34.4999093Z contiguous: bool, 2025-05-07T20:32:34.4999326Z compiled: bool, 2025-05-07T20:32:34.4999537Z ) -> None: 2025-05-07T20:32:34.4999745Z torch.manual_seed(2025) 2025-05-07T20:32:34.4999978Z 2025-05-07T20:32:34.5000234Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.5002393Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.5004249Z 2025-05-07T20:32:34.5004363Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.5004572Z 2025-05-07T20:32:34.5004672Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.5005070Z self=, 2025-05-07T20:32:34.5005459Z T=4096, 2025-05-07T20:32:34.5005646Z D=7168, 2025-05-07T20:32:34.5005830Z scale_ub=1200.0, 2025-05-07T20:32:34.5006045Z contiguous=True, 2025-05-07T20:32:34.5006255Z compiled=False, 2025-05-07T20:32:34.5006458Z ) 2025-05-07T20:32:34.6062273Z self = 2025-05-07T20:32:34.6062813Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.6063085Z 2025-05-07T20:32:34.6063164Z @given( 2025-05-07T20:32:34.6063403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6063718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6064023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6064363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6064698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6064987Z ) 2025-05-07T20:32:34.6065362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6065829Z def test_silu_mul_quant( 2025-05-07T20:32:34.6066077Z self, 2025-05-07T20:32:34.6066268Z T: int, 2025-05-07T20:32:34.6066466Z D: int, 2025-05-07T20:32:34.6066694Z scale_ub: Optional[float], 2025-05-07T20:32:34.6066957Z contiguous: bool, 2025-05-07T20:32:34.6067198Z compiled: bool, 2025-05-07T20:32:34.6067420Z ) -> None: 2025-05-07T20:32:34.6067630Z torch.manual_seed(2025) 2025-05-07T20:32:34.6067870Z 2025-05-07T20:32:34.6068145Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6070164Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.6071999Z 2025-05-07T20:32:34.6072127Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.6072340Z 2025-05-07T20:32:34.6072443Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6072855Z self=, 2025-05-07T20:32:34.6073254Z T=16384, 2025-05-07T20:32:34.6073444Z D=7168, 2025-05-07T20:32:34.6073633Z scale_ub=None, 2025-05-07T20:32:34.6073842Z contiguous=False, 2025-05-07T20:32:34.6074071Z compiled=True, 2025-05-07T20:32:34.6074269Z ) 2025-05-07T20:32:34.6074581Z self = 2025-05-07T20:32:34.6075065Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:34.6075350Z 2025-05-07T20:32:34.6075433Z @given( 2025-05-07T20:32:34.6075704Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6076112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6076525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6076856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6077191Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6077470Z ) 2025-05-07T20:32:34.6077822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6078262Z def test_silu_mul_quant( 2025-05-07T20:32:34.6078504Z self, 2025-05-07T20:32:34.6078771Z T: int, 2025-05-07T20:32:34.6078970Z D: int, 2025-05-07T20:32:34.6079188Z scale_ub: Optional[float], 2025-05-07T20:32:34.6079453Z contiguous: bool, 2025-05-07T20:32:34.6079690Z compiled: bool, 2025-05-07T20:32:34.6079914Z ) -> None: 2025-05-07T20:32:34.6080124Z torch.manual_seed(2025) 2025-05-07T20:32:34.6080370Z 2025-05-07T20:32:34.6080644Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6082655Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.6084481Z 2025-05-07T20:32:34.6084598Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.6084811Z 2025-05-07T20:32:34.6084914Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6085322Z self=, 2025-05-07T20:32:34.6085717Z T=4096, 2025-05-07T20:32:34.6085904Z D=7168, 2025-05-07T20:32:34.6086099Z scale_ub=None, 2025-05-07T20:32:34.6086317Z contiguous=True, 2025-05-07T20:32:34.6086541Z compiled=False, 2025-05-07T20:32:34.6086748Z ) 2025-05-07T20:32:34.6087066Z self = 2025-05-07T20:32:34.6087546Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.6087816Z 2025-05-07T20:32:34.6087892Z @given( 2025-05-07T20:32:34.6088120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6088428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6088730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6089053Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6089375Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6089651Z ) 2025-05-07T20:32:34.6089993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6090430Z def test_silu_mul_quant( 2025-05-07T20:32:34.6090673Z self, 2025-05-07T20:32:34.6090866Z T: int, 2025-05-07T20:32:34.6091065Z D: int, 2025-05-07T20:32:34.6091279Z scale_ub: Optional[float], 2025-05-07T20:32:34.6091547Z contiguous: bool, 2025-05-07T20:32:34.6091788Z compiled: bool, 2025-05-07T20:32:34.6092003Z ) -> None: 2025-05-07T20:32:34.6092214Z torch.manual_seed(2025) 2025-05-07T20:32:34.6092456Z 2025-05-07T20:32:34.6092720Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6094878Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.6096750Z 2025-05-07T20:32:34.6096867Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.6097084Z 2025-05-07T20:32:34.6097186Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6097591Z self=, 2025-05-07T20:32:34.6097983Z T=16384, 2025-05-07T20:32:34.6098175Z D=7168, 2025-05-07T20:32:34.6098417Z scale_ub=None, 2025-05-07T20:32:34.6098627Z contiguous=True, 2025-05-07T20:32:34.6098850Z compiled=False, 2025-05-07T20:32:34.6099056Z ) 2025-05-07T20:32:34.6099375Z self = 2025-05-07T20:32:34.6099871Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.6100148Z 2025-05-07T20:32:34.6100227Z @given( 2025-05-07T20:32:34.6100453Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6100766Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6101069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6101392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6101715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6101998Z ) 2025-05-07T20:32:34.6102341Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6102774Z def test_silu_mul_quant( 2025-05-07T20:32:34.6103024Z self, 2025-05-07T20:32:34.6103219Z T: int, 2025-05-07T20:32:34.6103416Z D: int, 2025-05-07T20:32:34.6103630Z scale_ub: Optional[float], 2025-05-07T20:32:34.6103903Z contiguous: bool, 2025-05-07T20:32:34.6104142Z compiled: bool, 2025-05-07T20:32:34.6104361Z ) -> None: 2025-05-07T20:32:34.6104575Z torch.manual_seed(2025) 2025-05-07T20:32:34.6104819Z 2025-05-07T20:32:34.6105084Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6107098Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.6108924Z 2025-05-07T20:32:34.6109043Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.6109252Z 2025-05-07T20:32:34.6109359Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6109768Z self=, 2025-05-07T20:32:34.6110164Z T=16384, 2025-05-07T20:32:34.6110362Z D=7168, 2025-05-07T20:32:34.6110550Z scale_ub=1200.0, 2025-05-07T20:32:34.6110772Z contiguous=True, 2025-05-07T20:32:34.6110993Z compiled=False, 2025-05-07T20:32:34.6111195Z ) 2025-05-07T20:32:34.6111528Z self = 2025-05-07T20:32:34.6112016Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.6112298Z 2025-05-07T20:32:34.6112378Z @given( 2025-05-07T20:32:34.6112621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6112928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6113229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6113554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6113878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6114165Z ) 2025-05-07T20:32:34.6114563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6114995Z def test_silu_mul_quant( 2025-05-07T20:32:34.6115334Z self, 2025-05-07T20:32:34.6115529Z T: int, 2025-05-07T20:32:34.6115724Z D: int, 2025-05-07T20:32:34.6115939Z scale_ub: Optional[float], 2025-05-07T20:32:34.6116214Z contiguous: bool, 2025-05-07T20:32:34.6116451Z compiled: bool, 2025-05-07T20:32:34.6116671Z ) -> None: 2025-05-07T20:32:34.6116883Z torch.manual_seed(2025) 2025-05-07T20:32:34.6117120Z 2025-05-07T20:32:34.6117427Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6119432Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.6121254Z 2025-05-07T20:32:34.6121371Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.6121579Z 2025-05-07T20:32:34.6121684Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6122088Z self=, 2025-05-07T20:32:34.6122490Z T=128, 2025-05-07T20:32:34.6122685Z D=5120, 2025-05-07T20:32:34.6122883Z scale_ub=1200.0, 2025-05-07T20:32:34.6123109Z contiguous=False, 2025-05-07T20:32:34.6123335Z compiled=False, 2025-05-07T20:32:34.6123550Z ) 2025-05-07T20:32:34.7408137Z self = 2025-05-07T20:32:34.7409162Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.7409729Z 2025-05-07T20:32:34.7409889Z @given( 2025-05-07T20:32:34.7410353Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7410957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7411557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7412205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7412839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7413521Z ) 2025-05-07T20:32:34.7414202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7415066Z def test_silu_mul_quant( 2025-05-07T20:32:34.7415351Z self, 2025-05-07T20:32:34.7415581Z T: int, 2025-05-07T20:32:34.7415792Z D: int, 2025-05-07T20:32:34.7416012Z scale_ub: Optional[float], 2025-05-07T20:32:34.7416284Z contiguous: bool, 2025-05-07T20:32:34.7416524Z compiled: bool, 2025-05-07T20:32:34.7416745Z ) -> None: 2025-05-07T20:32:34.7416960Z torch.manual_seed(2025) 2025-05-07T20:32:34.7417201Z 2025-05-07T20:32:34.7417472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7417816Z 2025-05-07T20:32:34.7418015Z x_sign = torch.sign(x) 2025-05-07T20:32:34.7418299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.7418615Z x = x_sign * x_clamp 2025-05-07T20:32:34.7418866Z x0 = x[:, :D] 2025-05-07T20:32:34.7419079Z x1 = x[:, D:] 2025-05-07T20:32:34.7419298Z 2025-05-07T20:32:34.7419495Z if contiguous: 2025-05-07T20:32:34.7419728Z x0 = x0.contiguous() 2025-05-07T20:32:34.7419986Z x1 = x1.contiguous() 2025-05-07T20:32:34.7420230Z 2025-05-07T20:32:34.7420430Z if scale_ub is not None: 2025-05-07T20:32:34.7420695Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.7421033Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.7421445Z ) 2025-05-07T20:32:34.7421635Z else: 2025-05-07T20:32:34.7421847Z scale_ub_tensor = None 2025-05-07T20:32:34.7422208Z 2025-05-07T20:32:34.7422445Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.7422761Z op = silu_mul_quant 2025-05-07T20:32:34.7423010Z if compiled: 2025-05-07T20:32:34.7423256Z op = torch.compile(op) 2025-05-07T20:32:34.7423553Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7423830Z 2025-05-07T20:32:34.7424081Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.7424248Z 2025-05-07T20:32:34.7424348Z moe/activation_test.py:117: 2025-05-07T20:32:34.7424648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7424982Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.7425260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7426101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.7426932Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.7427553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.7428356Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.7429068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.7429600Z kernel = self.compile( 2025-05-07T20:32:34.7430143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.7430790Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.7431186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7431414Z 2025-05-07T20:32:34.7431632Z self = 2025-05-07T20:32:34.7432704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.7434057Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016850cc0>} 2025-05-07T20:32:34.7435410Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.7436457Z context = 2025-05-07T20:32:34.7436744Z 2025-05-07T20:32:34.7436916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.7437443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.7437915Z module_map=module_map) 2025-05-07T20:32:34.7438279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.7438629Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.7438896Z E ^ 2025-05-07T20:32:34.7439360Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.7439804Z 2025-05-07T20:32:34.7440221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.7440726Z 2025-05-07T20:32:34.7440832Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7441244Z self=, 2025-05-07T20:32:34.7441651Z T=2048, 2025-05-07T20:32:34.7441840Z D=7168, 2025-05-07T20:32:34.7442082Z scale_ub=None, 2025-05-07T20:32:34.7442297Z contiguous=False, 2025-05-07T20:32:34.7442523Z compiled=False, 2025-05-07T20:32:34.7442807Z ) 2025-05-07T20:32:34.7443127Z self = 2025-05-07T20:32:34.7443616Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.7443889Z 2025-05-07T20:32:34.7443967Z @given( 2025-05-07T20:32:34.7444203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7444521Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7444869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7445200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7445566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7445876Z ) 2025-05-07T20:32:34.7446230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7446681Z def test_silu_mul_quant( 2025-05-07T20:32:34.7446929Z self, 2025-05-07T20:32:34.7447132Z T: int, 2025-05-07T20:32:34.7447340Z D: int, 2025-05-07T20:32:34.7447568Z scale_ub: Optional[float], 2025-05-07T20:32:34.7447848Z contiguous: bool, 2025-05-07T20:32:34.7448094Z compiled: bool, 2025-05-07T20:32:34.7448319Z ) -> None: 2025-05-07T20:32:34.7448539Z torch.manual_seed(2025) 2025-05-07T20:32:34.7448788Z 2025-05-07T20:32:34.7449068Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7451098Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.7452937Z 2025-05-07T20:32:34.7453123Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:34.7453337Z 2025-05-07T20:32:34.7453440Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7453853Z self=, 2025-05-07T20:32:34.7454250Z T=128, 2025-05-07T20:32:34.7454437Z D=7168, 2025-05-07T20:32:34.7454630Z scale_ub=1200.0, 2025-05-07T20:32:34.7454858Z contiguous=True, 2025-05-07T20:32:34.7455076Z compiled=True, 2025-05-07T20:32:34.7455284Z ) 2025-05-07T20:32:34.7766812Z self = 2025-05-07T20:32:34.7767341Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.7767604Z 2025-05-07T20:32:34.7767685Z @given( 2025-05-07T20:32:34.7767924Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7768243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7768551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7768886Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7769219Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7769509Z ) 2025-05-07T20:32:34.7769856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7770306Z def test_silu_mul_quant( 2025-05-07T20:32:34.7770556Z self, 2025-05-07T20:32:34.7770753Z T: int, 2025-05-07T20:32:34.7770957Z D: int, 2025-05-07T20:32:34.7771181Z scale_ub: Optional[float], 2025-05-07T20:32:34.7771450Z contiguous: bool, 2025-05-07T20:32:34.7771692Z compiled: bool, 2025-05-07T20:32:34.7771922Z ) -> None: 2025-05-07T20:32:34.7772136Z torch.manual_seed(2025) 2025-05-07T20:32:34.7772487Z 2025-05-07T20:32:34.7772765Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7773176Z 2025-05-07T20:32:34.7773492Z x_sign = torch.sign(x) 2025-05-07T20:32:34.7773789Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.7774092Z x = x_sign * x_clamp 2025-05-07T20:32:34.7774334Z x0 = x[:, :D] 2025-05-07T20:32:34.7774558Z x1 = x[:, D:] 2025-05-07T20:32:34.7774771Z 2025-05-07T20:32:34.7774959Z if contiguous: 2025-05-07T20:32:34.7775203Z x0 = x0.contiguous() 2025-05-07T20:32:34.7775525Z x1 = x1.contiguous() 2025-05-07T20:32:34.7775763Z 2025-05-07T20:32:34.7775960Z if scale_ub is not None: 2025-05-07T20:32:34.7776234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.7776564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.7776880Z ) 2025-05-07T20:32:34.7777078Z else: 2025-05-07T20:32:34.7777292Z scale_ub_tensor = None 2025-05-07T20:32:34.7777549Z 2025-05-07T20:32:34.7777785Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.7778105Z op = silu_mul_quant 2025-05-07T20:32:34.7778360Z if compiled: 2025-05-07T20:32:34.7778614Z op = torch.compile(op) 2025-05-07T20:32:34.7778909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7779190Z 2025-05-07T20:32:34.7779391Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.7779556Z 2025-05-07T20:32:34.7779660Z moe/activation_test.py:117: 2025-05-07T20:32:34.7788775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7789126Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.7789404Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.7789956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.7790511Z return fn(*args, **kwargs) 2025-05-07T20:32:34.7791168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.7791840Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.7792369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.7793030Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.7793682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.7794201Z kernel = self.compile( 2025-05-07T20:32:34.7794734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.7795371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.7795760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.7795986Z 2025-05-07T20:32:34.7796191Z self = 2025-05-07T20:32:34.7797256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.7798602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016851a80>} 2025-05-07T20:32:34.7799914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.7800909Z context = 2025-05-07T20:32:34.7801196Z 2025-05-07T20:32:34.7801454Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.7802037Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.7802496Z module_map=module_map) 2025-05-07T20:32:34.7802853Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.7803197Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.7803448Z E ^ 2025-05-07T20:32:34.7803903Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.7804389Z 2025-05-07T20:32:34.7804798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.7805304Z 2025-05-07T20:32:34.7805431Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7805869Z self=, 2025-05-07T20:32:34.7806269Z T=128, 2025-05-07T20:32:34.7806457Z D=7168, 2025-05-07T20:32:34.7806647Z scale_ub=1200.0, 2025-05-07T20:32:34.7806860Z contiguous=True, 2025-05-07T20:32:34.7807082Z compiled=False, 2025-05-07T20:32:34.7807286Z ) 2025-05-07T20:32:34.7807592Z self = 2025-05-07T20:32:34.7808073Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:34.7808339Z 2025-05-07T20:32:34.7808413Z @given( 2025-05-07T20:32:34.7808636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7808940Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7809237Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7809556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7809872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7810152Z ) 2025-05-07T20:32:34.7810493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7810920Z def test_silu_mul_quant( 2025-05-07T20:32:34.7811154Z self, 2025-05-07T20:32:34.7811350Z T: int, 2025-05-07T20:32:34.7811559Z D: int, 2025-05-07T20:32:34.7811771Z scale_ub: Optional[float], 2025-05-07T20:32:34.7812036Z contiguous: bool, 2025-05-07T20:32:34.7812272Z compiled: bool, 2025-05-07T20:32:34.7812489Z ) -> None: 2025-05-07T20:32:34.7812697Z torch.manual_seed(2025) 2025-05-07T20:32:34.7812930Z 2025-05-07T20:32:34.7813246Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7813582Z 2025-05-07T20:32:34.7813773Z x_sign = torch.sign(x) 2025-05-07T20:32:34.7814049Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.7816082Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.7817897Z 2025-05-07T20:32:34.7818012Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.7818222Z 2025-05-07T20:32:34.7818320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7818723Z self=, 2025-05-07T20:32:34.7819103Z T=128, 2025-05-07T20:32:34.7819288Z D=5120, 2025-05-07T20:32:34.7819477Z scale_ub=1200.0, 2025-05-07T20:32:34.7819689Z contiguous=True, 2025-05-07T20:32:34.7819903Z compiled=True, 2025-05-07T20:32:34.7820099Z ) 2025-05-07T20:32:34.7820409Z self = 2025-05-07T20:32:34.7820928Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:34.7821266Z 2025-05-07T20:32:34.7821344Z @given( 2025-05-07T20:32:34.7821566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.7821863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.7822160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.7822479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.7822794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.7823114Z ) 2025-05-07T20:32:34.7823451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.7823878Z def test_silu_mul_quant( 2025-05-07T20:32:34.7824106Z self, 2025-05-07T20:32:34.7824297Z T: int, 2025-05-07T20:32:34.7824483Z D: int, 2025-05-07T20:32:34.7824689Z scale_ub: Optional[float], 2025-05-07T20:32:34.7824954Z contiguous: bool, 2025-05-07T20:32:34.7825188Z compiled: bool, 2025-05-07T20:32:34.7825401Z ) -> None: 2025-05-07T20:32:34.7825616Z torch.manual_seed(2025) 2025-05-07T20:32:34.7825853Z 2025-05-07T20:32:34.7826113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.7826445Z 2025-05-07T20:32:34.7826633Z x_sign = torch.sign(x) 2025-05-07T20:32:34.7826914Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.7828859Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:34.7830676Z 2025-05-07T20:32:34.7830797Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:34.7831007Z 2025-05-07T20:32:34.7831107Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.7831507Z self=, 2025-05-07T20:32:34.7831891Z T=128, 2025-05-07T20:32:34.7832070Z D=7168, 2025-05-07T20:32:34.7832253Z scale_ub=None, 2025-05-07T20:32:34.7832455Z contiguous=True, 2025-05-07T20:32:34.7832674Z compiled=True, 2025-05-07T20:32:34.7832871Z ) 2025-05-07T20:32:35.2583968Z self = 2025-05-07T20:32:35.2584480Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.2584749Z 2025-05-07T20:32:35.2584828Z @given( 2025-05-07T20:32:35.2585060Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2585388Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2585727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2586054Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2586368Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2586650Z ) 2025-05-07T20:32:35.2586994Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2587426Z def test_silu_mul_quant( 2025-05-07T20:32:35.2587657Z self, 2025-05-07T20:32:35.2587852Z T: int, 2025-05-07T20:32:35.2588050Z D: int, 2025-05-07T20:32:35.2588260Z scale_ub: Optional[float], 2025-05-07T20:32:35.2588526Z contiguous: bool, 2025-05-07T20:32:35.2588761Z compiled: bool, 2025-05-07T20:32:35.2588982Z ) -> None: 2025-05-07T20:32:35.2589194Z torch.manual_seed(2025) 2025-05-07T20:32:35.2589433Z 2025-05-07T20:32:35.2589698Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2591993Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.2593872Z 2025-05-07T20:32:35.2593992Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.2594204Z 2025-05-07T20:32:35.2629659Z FAILED 2025-05-07T20:32:35.2630034Z 2025-05-07T20:32:35.2630496Z =================================== FAILURES =================================== 2025-05-07T20:32:35.2631152Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:35.2631779Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:35.2632649Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:35.2633400Z | yield 2025-05-07T20:32:35.2633993Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:32:35.2634689Z | self._callTestMethod(testMethod) 2025-05-07T20:32:35.2635457Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:32:35.2636224Z | if method() is not None: 2025-05-07T20:32:35.2636577Z | ^^^^^^^^ 2025-05-07T20:32:35.2637433Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:35.2638432Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2638850Z | ^^^^^^^ 2025-05-07T20:32:35.2639617Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:35.2640477Z | raise the_error_hypothesis_found 2025-05-07T20:32:35.2641064Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:35.2641645Z +-+---------------- 1 ---------------- 2025-05-07T20:32:35.2642034Z | Traceback (most recent call last): 2025-05-07T20:32:35.2643019Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:35.2644081Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2644586Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.2647355Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.2650122Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.2650715Z | self=, 2025-05-07T20:32:35.2651280Z | T=2048, 2025-05-07T20:32:35.2651604Z | D=5120, # or any other generated value 2025-05-07T20:32:35.2652058Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:35.2652564Z | contiguous=True, # or any other generated value 2025-05-07T20:32:35.2653327Z | compiled=False, # or any other generated value 2025-05-07T20:32:35.2653740Z | ) 2025-05-07T20:32:35.2653991Z | 2025-05-07T20:32:35.2654914Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:35.2655754Z +---------------- 2 ---------------- 2025-05-07T20:32:35.2656149Z | Traceback (most recent call last): 2025-05-07T20:32:35.2657123Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:35.2658258Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2658780Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.2661717Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.2664434Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.2665033Z | self=, 2025-05-07T20:32:35.2665630Z | T=128, 2025-05-07T20:32:35.2665940Z | D=7168, 2025-05-07T20:32:35.2666239Z | scale_ub=None, 2025-05-07T20:32:35.2666496Z | contiguous=True, 2025-05-07T20:32:35.2666766Z | compiled=True, 2025-05-07T20:32:35.2667004Z | ) 2025-05-07T20:32:35.2667204Z | 2025-05-07T20:32:35.2667739Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:35.2668345Z +---------------- 3 ---------------- 2025-05-07T20:32:35.2668648Z | Traceback (most recent call last): 2025-05-07T20:32:35.2670070Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:35.2670858Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2671235Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.2673423Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.2675512Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.2675969Z | self=, 2025-05-07T20:32:35.2676388Z | T=128, 2025-05-07T20:32:35.2676600Z | D=5120, 2025-05-07T20:32:35.2676827Z | scale_ub=1200.0, 2025-05-07T20:32:35.2677085Z | contiguous=True, 2025-05-07T20:32:35.2677339Z | compiled=True, 2025-05-07T20:32:35.2677586Z | ) 2025-05-07T20:32:35.2677784Z | 2025-05-07T20:32:35.2678312Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:35.2678925Z +---------------- 4 ---------------- 2025-05-07T20:32:35.2679228Z | Traceback (most recent call last): 2025-05-07T20:32:35.2680191Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:35.2680918Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.2681219Z | ^^^^^^^^ 2025-05-07T20:32:35.2681867Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:35.2682559Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.2682984Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.2683787Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:35.2684588Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.2685204Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:35.2685949Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2686528Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.2687385Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:35.2688437Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.2689080Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.2689949Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:35.2690899Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.2691424Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.2692259Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:35.2693127Z | fn() 2025-05-07T20:32:35.2693914Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:35.2694785Z | self.fn.run( 2025-05-07T20:32:35.2695524Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:35.2696322Z | kernel = self.compile( 2025-05-07T20:32:35.2696691Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:35.2697512Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:35.2698493Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2699024Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.2699928Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:35.2701011Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2701673Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:35.2834115Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2834643Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.2835013Z | ^ 2025-05-07T20:32:35.2835651Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2836424Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:35.2837319Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:35.2838212Z | self=, 2025-05-07T20:32:35.2838800Z | T=1, # or any other generated value 2025-05-07T20:32:35.2839232Z | D=5120, # or any other generated value 2025-05-07T20:32:35.2839695Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:35.2840185Z | contiguous=True, # or any other generated value 2025-05-07T20:32:35.2840684Z | compiled=True, # or any other generated value 2025-05-07T20:32:35.2841227Z | ) 2025-05-07T20:32:35.2841421Z | 2025-05-07T20:32:35.2841955Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:35.2842560Z +------------------------------------ 2025-05-07T20:32:35.2842928Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:35.2843297Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2843715Z self=, 2025-05-07T20:32:35.2844121Z T=1, 2025-05-07T20:32:35.2844299Z D=5120, 2025-05-07T20:32:35.2844488Z scale_ub=None, 2025-05-07T20:32:35.2844704Z contiguous=True, 2025-05-07T20:32:35.2844918Z compiled=True, 2025-05-07T20:32:35.2845128Z ) 2025-05-07T20:32:35.2845452Z self = 2025-05-07T20:32:35.2845969Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.2846240Z 2025-05-07T20:32:35.2846316Z @given( 2025-05-07T20:32:35.2846545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2846856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2847153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2847477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2847806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2848086Z ) 2025-05-07T20:32:35.2848436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2848877Z def test_silu_mul_quant( 2025-05-07T20:32:35.2849113Z self, 2025-05-07T20:32:35.2849308Z T: int, 2025-05-07T20:32:35.2849509Z D: int, 2025-05-07T20:32:35.2849726Z scale_ub: Optional[float], 2025-05-07T20:32:35.2849997Z contiguous: bool, 2025-05-07T20:32:35.2850242Z compiled: bool, 2025-05-07T20:32:35.2850469Z ) -> None: 2025-05-07T20:32:35.2850677Z torch.manual_seed(2025) 2025-05-07T20:32:35.2850918Z 2025-05-07T20:32:35.2851194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2851529Z 2025-05-07T20:32:35.2851725Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2852015Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2852320Z x = x_sign * x_clamp 2025-05-07T20:32:35.2852563Z x0 = x[:, :D] 2025-05-07T20:32:35.2852780Z x1 = x[:, D:] 2025-05-07T20:32:35.2853094Z 2025-05-07T20:32:35.2853299Z if contiguous: 2025-05-07T20:32:35.2853529Z x0 = x0.contiguous() 2025-05-07T20:32:35.2853781Z x1 = x1.contiguous() 2025-05-07T20:32:35.2854021Z 2025-05-07T20:32:35.2854212Z if scale_ub is not None: 2025-05-07T20:32:35.2854477Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2854814Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2855124Z ) 2025-05-07T20:32:35.2855310Z else: 2025-05-07T20:32:35.2855535Z scale_ub_tensor = None 2025-05-07T20:32:35.2855816Z 2025-05-07T20:32:35.2856046Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2856351Z op = silu_mul_quant 2025-05-07T20:32:35.2856606Z if compiled: 2025-05-07T20:32:35.2856914Z op = torch.compile(op) 2025-05-07T20:32:35.2857199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2857472Z 2025-05-07T20:32:35.2857747Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.2858030Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.2858318Z 2025-05-07T20:32:35.2858556Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2858881Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.2859172Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.2859876Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.2860233Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.2860535Z 2025-05-07T20:32:35.2860738Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.2860931Z 2025-05-07T20:32:35.2861036Z moe/activation_test.py:126: 2025-05-07T20:32:35.2861325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2878768Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.2879122Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.2879897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.2880642Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.2881182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2881868Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2882543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.2883255Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.2883970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.2884609Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.2885200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.2885754Z fn() 2025-05-07T20:32:35.2886252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.2886819Z self.fn.run( 2025-05-07T20:32:35.2887282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2887807Z kernel = self.compile( 2025-05-07T20:32:35.2888333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2888982Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2889368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2889598Z 2025-05-07T20:32:35.2889812Z self = 2025-05-07T20:32:35.2890866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2892230Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1098d4c20>} 2025-05-07T20:32:35.2893641Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2894650Z context = 2025-05-07T20:32:35.2895064Z 2025-05-07T20:32:35.2895229Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2895860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2896323Z module_map=module_map) 2025-05-07T20:32:35.2896684Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2897033Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.2897299Z E ^ 2025-05-07T20:32:35.2897749Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2898250Z 2025-05-07T20:32:35.2898659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2899156Z 2025-05-07T20:32:35.2899258Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2899662Z self=, 2025-05-07T20:32:35.2900055Z T=2048, 2025-05-07T20:32:35.2900243Z D=5120, 2025-05-07T20:32:35.2900430Z scale_ub=1200.0, 2025-05-07T20:32:35.2900647Z contiguous=True, 2025-05-07T20:32:35.2900864Z compiled=False, 2025-05-07T20:32:35.2901062Z ) 2025-05-07T20:32:35.2901377Z self = 2025-05-07T20:32:35.2901861Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.2902124Z 2025-05-07T20:32:35.2902199Z @given( 2025-05-07T20:32:35.2902425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2902735Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2903031Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2903359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2903685Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2903965Z ) 2025-05-07T20:32:35.2904315Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2904757Z def test_silu_mul_quant( 2025-05-07T20:32:35.2905006Z self, 2025-05-07T20:32:35.2905198Z T: int, 2025-05-07T20:32:35.2905405Z D: int, 2025-05-07T20:32:35.2905667Z scale_ub: Optional[float], 2025-05-07T20:32:35.2905942Z contiguous: bool, 2025-05-07T20:32:35.2906192Z compiled: bool, 2025-05-07T20:32:35.2906419Z ) -> None: 2025-05-07T20:32:35.2906620Z torch.manual_seed(2025) 2025-05-07T20:32:35.2906853Z 2025-05-07T20:32:35.2907120Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2907458Z 2025-05-07T20:32:35.2907651Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2907942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2908248Z x = x_sign * x_clamp 2025-05-07T20:32:35.2908488Z x0 = x[:, :D] 2025-05-07T20:32:35.2908704Z x1 = x[:, D:] 2025-05-07T20:32:35.2908914Z 2025-05-07T20:32:35.2909104Z if contiguous: 2025-05-07T20:32:35.2909340Z x0 = x0.contiguous() 2025-05-07T20:32:35.2909600Z x1 = x1.contiguous() 2025-05-07T20:32:35.2909835Z 2025-05-07T20:32:35.2910028Z if scale_ub is not None: 2025-05-07T20:32:35.2910298Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2910626Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2910935Z ) 2025-05-07T20:32:35.2911128Z else: 2025-05-07T20:32:35.2911333Z scale_ub_tensor = None 2025-05-07T20:32:35.2911588Z 2025-05-07T20:32:35.2911820Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2912128Z op = silu_mul_quant 2025-05-07T20:32:35.2912378Z if compiled: 2025-05-07T20:32:35.2912628Z op = torch.compile(op) 2025-05-07T20:32:35.2912913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2913241Z 2025-05-07T20:32:35.2913438Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2913599Z 2025-05-07T20:32:35.2913710Z moe/activation_test.py:117: 2025-05-07T20:32:35.2914074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2914402Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2914681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2915356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2916082Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2916611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2917282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2917931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2918462Z kernel = self.compile( 2025-05-07T20:32:35.2919004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2919647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2920043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2920279Z 2025-05-07T20:32:35.2920484Z self = 2025-05-07T20:32:35.2921549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2922897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb109990180>} 2025-05-07T20:32:35.2924221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2925238Z context = 2025-05-07T20:32:35.2925521Z 2025-05-07T20:32:35.2925694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2926215Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2926676Z module_map=module_map) 2025-05-07T20:32:35.2927033Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2927382Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2927632Z E ^ 2025-05-07T20:32:35.2928086Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2928530Z 2025-05-07T20:32:35.2928947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2929445Z 2025-05-07T20:32:35.2929554Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2929954Z self=, 2025-05-07T20:32:35.2930346Z T=2048, 2025-05-07T20:32:35.2930529Z D=5120, 2025-05-07T20:32:35.2930714Z scale_ub=1200.0, 2025-05-07T20:32:35.2930933Z contiguous=True, 2025-05-07T20:32:35.2931150Z compiled=True, 2025-05-07T20:32:35.2931356Z ) 2025-05-07T20:32:35.2931671Z self = 2025-05-07T20:32:35.2932156Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.2932418Z 2025-05-07T20:32:35.2932501Z @given( 2025-05-07T20:32:35.2932724Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2933180Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2933483Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2933880Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2934201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2934478Z ) 2025-05-07T20:32:35.2934813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2935245Z def test_silu_mul_quant( 2025-05-07T20:32:35.2935488Z self, 2025-05-07T20:32:35.2935705Z T: int, 2025-05-07T20:32:35.2935949Z D: int, 2025-05-07T20:32:35.2936162Z scale_ub: Optional[float], 2025-05-07T20:32:35.2936421Z contiguous: bool, 2025-05-07T20:32:35.2936653Z compiled: bool, 2025-05-07T20:32:35.2936871Z ) -> None: 2025-05-07T20:32:35.2937077Z torch.manual_seed(2025) 2025-05-07T20:32:35.2937317Z 2025-05-07T20:32:35.2937579Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2937911Z 2025-05-07T20:32:35.2938098Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2938387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2938686Z x = x_sign * x_clamp 2025-05-07T20:32:35.2938916Z x0 = x[:, :D] 2025-05-07T20:32:35.2939127Z x1 = x[:, D:] 2025-05-07T20:32:35.2939326Z 2025-05-07T20:32:35.2939501Z if contiguous: 2025-05-07T20:32:35.2939724Z x0 = x0.contiguous() 2025-05-07T20:32:35.2939969Z x1 = x1.contiguous() 2025-05-07T20:32:35.2940198Z 2025-05-07T20:32:35.2940392Z if scale_ub is not None: 2025-05-07T20:32:35.2940655Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2940974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2941273Z ) 2025-05-07T20:32:35.2941462Z else: 2025-05-07T20:32:35.2941661Z scale_ub_tensor = None 2025-05-07T20:32:35.2941906Z 2025-05-07T20:32:35.2942134Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2942434Z op = silu_mul_quant 2025-05-07T20:32:35.2942673Z if compiled: 2025-05-07T20:32:35.2942911Z op = torch.compile(op) 2025-05-07T20:32:35.2943193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2943456Z 2025-05-07T20:32:35.2943641Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.2943918Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.2944196Z 2025-05-07T20:32:35.2944427Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2944756Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.2945036Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.2945340Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.2945713Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.2946037Z 2025-05-07T20:32:35.2946228Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.2946422Z 2025-05-07T20:32:35.2946519Z moe/activation_test.py:126: 2025-05-07T20:32:35.2946815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2947136Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.2947455Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.2948219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.2948951Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.2949485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2950144Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2950808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.2951556Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.2952362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.2952984Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.2953568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.2954070Z fn() 2025-05-07T20:32:35.2954575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.2955185Z self.fn.run( 2025-05-07T20:32:35.2955643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2956166Z kernel = self.compile( 2025-05-07T20:32:35.2956697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2957342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2957733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2957963Z 2025-05-07T20:32:35.2958166Z self = 2025-05-07T20:32:35.2959428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2960812Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10852d260>} 2025-05-07T20:32:35.2962122Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2963137Z context = 2025-05-07T20:32:35.2963425Z 2025-05-07T20:32:35.2963588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2964098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2964551Z module_map=module_map) 2025-05-07T20:32:35.2964912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2965271Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.2965559Z E ^ 2025-05-07T20:32:35.2966032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2966477Z 2025-05-07T20:32:35.2966886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2967391Z 2025-05-07T20:32:35.2967499Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2967905Z self=, 2025-05-07T20:32:35.2968303Z T=16384, 2025-05-07T20:32:35.2968496Z D=7168, 2025-05-07T20:32:35.2968688Z scale_ub=1200.0, 2025-05-07T20:32:35.2968909Z contiguous=False, 2025-05-07T20:32:35.2969131Z compiled=False, 2025-05-07T20:32:35.2969331Z ) 2025-05-07T20:32:35.2969638Z self = 2025-05-07T20:32:35.2970134Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.2970408Z 2025-05-07T20:32:35.2970493Z @given( 2025-05-07T20:32:35.2970717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2971026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2971330Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2971734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2972056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2972445Z ) 2025-05-07T20:32:35.2972792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2973290Z def test_silu_mul_quant( 2025-05-07T20:32:35.2973527Z self, 2025-05-07T20:32:35.2973720Z T: int, 2025-05-07T20:32:35.2973909Z D: int, 2025-05-07T20:32:35.2974128Z scale_ub: Optional[float], 2025-05-07T20:32:35.2974395Z contiguous: bool, 2025-05-07T20:32:35.2974695Z compiled: bool, 2025-05-07T20:32:35.2974919Z ) -> None: 2025-05-07T20:32:35.2975134Z torch.manual_seed(2025) 2025-05-07T20:32:35.2975365Z 2025-05-07T20:32:35.2975648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2976024Z 2025-05-07T20:32:35.2976215Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2976506Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2976810Z x = x_sign * x_clamp 2025-05-07T20:32:35.2977049Z x0 = x[:, :D] 2025-05-07T20:32:35.2977263Z x1 = x[:, D:] 2025-05-07T20:32:35.2977474Z 2025-05-07T20:32:35.2977659Z if contiguous: 2025-05-07T20:32:35.2977885Z x0 = x0.contiguous() 2025-05-07T20:32:35.2978149Z x1 = x1.contiguous() 2025-05-07T20:32:35.2978391Z 2025-05-07T20:32:35.2978578Z if scale_ub is not None: 2025-05-07T20:32:35.2978847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2979181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2979480Z ) 2025-05-07T20:32:35.2979672Z else: 2025-05-07T20:32:35.2979885Z scale_ub_tensor = None 2025-05-07T20:32:35.2980127Z 2025-05-07T20:32:35.2980355Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2980668Z op = silu_mul_quant 2025-05-07T20:32:35.2980909Z if compiled: 2025-05-07T20:32:35.2981153Z op = torch.compile(op) 2025-05-07T20:32:35.2981453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2981721Z 2025-05-07T20:32:35.2981914Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2982083Z 2025-05-07T20:32:35.2982181Z moe/activation_test.py:117: 2025-05-07T20:32:35.2982472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2982792Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2983068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2983745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2984416Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2984946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2985616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2986323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2986836Z kernel = self.compile( 2025-05-07T20:32:35.2987368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2988009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2988392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2988623Z 2025-05-07T20:32:35.2988827Z self = 2025-05-07T20:32:35.2989888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2991357Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10852f060>} 2025-05-07T20:32:35.2992677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2993683Z context = 2025-05-07T20:32:35.2993971Z 2025-05-07T20:32:35.2994133Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2994685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2995148Z module_map=module_map) 2025-05-07T20:32:35.2995504Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2995855Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2996119Z E ^ 2025-05-07T20:32:35.2996578Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2997023Z 2025-05-07T20:32:35.2997432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2997937Z 2025-05-07T20:32:35.2998039Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2998443Z self=, 2025-05-07T20:32:35.2998834Z T=1, 2025-05-07T20:32:35.2999008Z D=7168, 2025-05-07T20:32:35.2999197Z scale_ub=None, 2025-05-07T20:32:35.2999408Z contiguous=True, 2025-05-07T20:32:35.2999634Z compiled=True, 2025-05-07T20:32:35.2999838Z ) 2025-05-07T20:32:35.3000152Z self = 2025-05-07T20:32:35.3000630Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3000894Z 2025-05-07T20:32:35.3000971Z @given( 2025-05-07T20:32:35.3001207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3001512Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3001815Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3002135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3002451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3002736Z ) 2025-05-07T20:32:35.3003077Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3003508Z def test_silu_mul_quant( 2025-05-07T20:32:35.3003742Z self, 2025-05-07T20:32:35.3003939Z T: int, 2025-05-07T20:32:35.3004133Z D: int, 2025-05-07T20:32:35.3004354Z scale_ub: Optional[float], 2025-05-07T20:32:35.3004628Z contiguous: bool, 2025-05-07T20:32:35.3004865Z compiled: bool, 2025-05-07T20:32:35.3005079Z ) -> None: 2025-05-07T20:32:35.3005293Z torch.manual_seed(2025) 2025-05-07T20:32:35.3005529Z 2025-05-07T20:32:35.3005793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3006128Z 2025-05-07T20:32:35.3006321Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3006603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3006906Z x = x_sign * x_clamp 2025-05-07T20:32:35.3007144Z x0 = x[:, :D] 2025-05-07T20:32:35.3007351Z x1 = x[:, D:] 2025-05-07T20:32:35.3007557Z 2025-05-07T20:32:35.3007745Z if contiguous: 2025-05-07T20:32:35.3007969Z x0 = x0.contiguous() 2025-05-07T20:32:35.3008222Z x1 = x1.contiguous() 2025-05-07T20:32:35.3008455Z 2025-05-07T20:32:35.3008641Z if scale_ub is not None: 2025-05-07T20:32:35.3008907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3009232Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3009577Z ) 2025-05-07T20:32:35.3009767Z else: 2025-05-07T20:32:35.3009972Z scale_ub_tensor = None 2025-05-07T20:32:35.3010292Z 2025-05-07T20:32:35.3010517Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3010827Z op = silu_mul_quant 2025-05-07T20:32:35.3011076Z if compiled: 2025-05-07T20:32:35.3011312Z op = torch.compile(op) 2025-05-07T20:32:35.3011602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3011870Z 2025-05-07T20:32:35.3012053Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3012373Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3012657Z 2025-05-07T20:32:35.3012887Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3013264Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3013549Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3013850Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3014198Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3014501Z 2025-05-07T20:32:35.3014705Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3014893Z 2025-05-07T20:32:35.3014987Z moe/activation_test.py:126: 2025-05-07T20:32:35.3015273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3015595Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3015908Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3016679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3017414Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3017946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3018606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3019282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3019985Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3020698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3021314Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3021899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3022406Z fn() 2025-05-07T20:32:35.3022898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3023463Z self.fn.run( 2025-05-07T20:32:35.3023922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3024443Z kernel = self.compile( 2025-05-07T20:32:35.3024973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3025606Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3025988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3026209Z 2025-05-07T20:32:35.3026411Z self = 2025-05-07T20:32:35.3027467Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3028804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb103820fe0>} 2025-05-07T20:32:35.3030259Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3039986Z context = 2025-05-07T20:32:35.3040279Z 2025-05-07T20:32:35.3040447Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3040956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3041492Z module_map=module_map) 2025-05-07T20:32:35.3041850Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3042200Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3042459Z E ^ 2025-05-07T20:32:35.3042908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3043349Z 2025-05-07T20:32:35.3043768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3044273Z 2025-05-07T20:32:35.3044374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3044778Z self=, 2025-05-07T20:32:35.3045164Z T=4096, 2025-05-07T20:32:35.3045344Z D=5120, 2025-05-07T20:32:35.3045527Z scale_ub=None, 2025-05-07T20:32:35.3045747Z contiguous=False, 2025-05-07T20:32:35.3046009Z compiled=False, 2025-05-07T20:32:35.3046202Z ) 2025-05-07T20:32:35.3046510Z self = 2025-05-07T20:32:35.3046987Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3047255Z 2025-05-07T20:32:35.3047328Z @given( 2025-05-07T20:32:35.3047552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3047852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3048146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3048467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3048784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3049056Z ) 2025-05-07T20:32:35.3049392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3049819Z def test_silu_mul_quant( 2025-05-07T20:32:35.3050045Z self, 2025-05-07T20:32:35.3050229Z T: int, 2025-05-07T20:32:35.3050425Z D: int, 2025-05-07T20:32:35.3050629Z scale_ub: Optional[float], 2025-05-07T20:32:35.3050894Z contiguous: bool, 2025-05-07T20:32:35.3051127Z compiled: bool, 2025-05-07T20:32:35.3051335Z ) -> None: 2025-05-07T20:32:35.3051539Z torch.manual_seed(2025) 2025-05-07T20:32:35.3051768Z 2025-05-07T20:32:35.3052028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3052361Z 2025-05-07T20:32:35.3052541Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3052821Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3053196Z x = x_sign * x_clamp 2025-05-07T20:32:35.3053425Z x0 = x[:, :D] 2025-05-07T20:32:35.3053628Z x1 = x[:, D:] 2025-05-07T20:32:35.3053822Z 2025-05-07T20:32:35.3053997Z if contiguous: 2025-05-07T20:32:35.3054216Z x0 = x0.contiguous() 2025-05-07T20:32:35.3054462Z x1 = x1.contiguous() 2025-05-07T20:32:35.3054693Z 2025-05-07T20:32:35.3054877Z if scale_ub is not None: 2025-05-07T20:32:35.3055134Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3055457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3055753Z ) 2025-05-07T20:32:35.3055935Z else: 2025-05-07T20:32:35.3056134Z scale_ub_tensor = None 2025-05-07T20:32:35.3056427Z 2025-05-07T20:32:35.3056644Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3056944Z op = silu_mul_quant 2025-05-07T20:32:35.3057257Z if compiled: 2025-05-07T20:32:35.3057492Z op = torch.compile(op) 2025-05-07T20:32:35.3057780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3058041Z 2025-05-07T20:32:35.3058222Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3058379Z 2025-05-07T20:32:35.3058475Z moe/activation_test.py:117: 2025-05-07T20:32:35.3058759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3059123Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3059694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3060386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3061059Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3061584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3062247Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3062889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3063401Z kernel = self.compile( 2025-05-07T20:32:35.3063925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3064564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3064945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3065164Z 2025-05-07T20:32:35.3065372Z self = 2025-05-07T20:32:35.3066478Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3067819Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb103822160>} 2025-05-07T20:32:35.3069126Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3070129Z context = 2025-05-07T20:32:35.3070413Z 2025-05-07T20:32:35.3070579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3071084Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3071535Z module_map=module_map) 2025-05-07T20:32:35.3071897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3072240Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3072491Z E ^ 2025-05-07T20:32:35.3072944Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3073382Z 2025-05-07T20:32:35.3073791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3074288Z 2025-05-07T20:32:35.3074389Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3074790Z self=, 2025-05-07T20:32:35.3075179Z T=4096, 2025-05-07T20:32:35.3075358Z D=7168, 2025-05-07T20:32:35.3075545Z scale_ub=None, 2025-05-07T20:32:35.3075779Z contiguous=False, 2025-05-07T20:32:35.3076015Z compiled=False, 2025-05-07T20:32:35.3076328Z ) 2025-05-07T20:32:35.3076635Z self = 2025-05-07T20:32:35.3077221Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3077492Z 2025-05-07T20:32:35.3077566Z @given( 2025-05-07T20:32:35.3077787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3078084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3078374Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3078690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3079071Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3079340Z ) 2025-05-07T20:32:35.3079675Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3080100Z def test_silu_mul_quant( 2025-05-07T20:32:35.3080328Z self, 2025-05-07T20:32:35.3080516Z T: int, 2025-05-07T20:32:35.3080702Z D: int, 2025-05-07T20:32:35.3080910Z scale_ub: Optional[float], 2025-05-07T20:32:35.3081163Z contiguous: bool, 2025-05-07T20:32:35.3081392Z compiled: bool, 2025-05-07T20:32:35.3081605Z ) -> None: 2025-05-07T20:32:35.3081803Z torch.manual_seed(2025) 2025-05-07T20:32:35.3082031Z 2025-05-07T20:32:35.3082292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3082619Z 2025-05-07T20:32:35.3082804Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3083082Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3083382Z x = x_sign * x_clamp 2025-05-07T20:32:35.3083614Z x0 = x[:, :D] 2025-05-07T20:32:35.3083821Z x1 = x[:, D:] 2025-05-07T20:32:35.3084022Z 2025-05-07T20:32:35.3084200Z if contiguous: 2025-05-07T20:32:35.3084419Z x0 = x0.contiguous() 2025-05-07T20:32:35.3084665Z x1 = x1.contiguous() 2025-05-07T20:32:35.3084899Z 2025-05-07T20:32:35.3085086Z if scale_ub is not None: 2025-05-07T20:32:35.3085352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3085679Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3085976Z ) 2025-05-07T20:32:35.3086157Z else: 2025-05-07T20:32:35.3086353Z scale_ub_tensor = None 2025-05-07T20:32:35.3086593Z 2025-05-07T20:32:35.3086813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3087111Z op = silu_mul_quant 2025-05-07T20:32:35.3087352Z if compiled: 2025-05-07T20:32:35.3087590Z op = torch.compile(op) 2025-05-07T20:32:35.3087874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3088136Z 2025-05-07T20:32:35.3088317Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3088475Z 2025-05-07T20:32:35.3088567Z moe/activation_test.py:117: 2025-05-07T20:32:35.3088851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3089172Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3089437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3090104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3090771Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3091288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3091944Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3092588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3093160Z kernel = self.compile( 2025-05-07T20:32:35.3093686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3094313Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3094749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3094968Z 2025-05-07T20:32:35.3095249Z self = 2025-05-07T20:32:35.3096300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3097638Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb103823240>} 2025-05-07T20:32:35.3098983Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3099979Z context = 2025-05-07T20:32:35.3100260Z 2025-05-07T20:32:35.3100425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3100935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3101044Z module_map=module_map) 2025-05-07T20:32:35.3101200Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3101301Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3101378Z E ^ 2025-05-07T20:32:35.3101725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3101732Z 2025-05-07T20:32:35.3102136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3102141Z 2025-05-07T20:32:35.3102238Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3102457Z self=, 2025-05-07T20:32:35.3102537Z T=128, 2025-05-07T20:32:35.3102612Z D=7168, 2025-05-07T20:32:35.3102694Z scale_ub=None, 2025-05-07T20:32:35.3102776Z contiguous=False, 2025-05-07T20:32:35.3102856Z compiled=True, 2025-05-07T20:32:35.3102934Z ) 2025-05-07T20:32:35.3103146Z self = 2025-05-07T20:32:35.3103310Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3103315Z 2025-05-07T20:32:35.3103395Z @given( 2025-05-07T20:32:35.3103511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3103608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3103723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3103834Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3103943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3104018Z ) 2025-05-07T20:32:35.3104255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3104350Z def test_silu_mul_quant( 2025-05-07T20:32:35.3104424Z self, 2025-05-07T20:32:35.3104495Z T: int, 2025-05-07T20:32:35.3104571Z D: int, 2025-05-07T20:32:35.3104668Z scale_ub: Optional[float], 2025-05-07T20:32:35.3104754Z contiguous: bool, 2025-05-07T20:32:35.3104836Z compiled: bool, 2025-05-07T20:32:35.3104909Z ) -> None: 2025-05-07T20:32:35.3105000Z torch.manual_seed(2025) 2025-05-07T20:32:35.3105073Z 2025-05-07T20:32:35.3105237Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3105313Z 2025-05-07T20:32:35.3105401Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3105522Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3105610Z x = x_sign * x_clamp 2025-05-07T20:32:35.3105686Z x0 = x[:, :D] 2025-05-07T20:32:35.3105836Z x1 = x[:, D:] 2025-05-07T20:32:35.3105907Z 2025-05-07T20:32:35.3105988Z if contiguous: 2025-05-07T20:32:35.3106154Z x0 = x0.contiguous() 2025-05-07T20:32:35.3106244Z x1 = x1.contiguous() 2025-05-07T20:32:35.3106316Z 2025-05-07T20:32:35.3106403Z if scale_ub is not None: 2025-05-07T20:32:35.3106505Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3106634Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3106708Z ) 2025-05-07T20:32:35.3106782Z else: 2025-05-07T20:32:35.3106916Z scale_ub_tensor = None 2025-05-07T20:32:35.3106992Z 2025-05-07T20:32:35.3107118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3107204Z op = silu_mul_quant 2025-05-07T20:32:35.3107287Z if compiled: 2025-05-07T20:32:35.3107385Z op = torch.compile(op) 2025-05-07T20:32:35.3107484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3107561Z 2025-05-07T20:32:35.3107649Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3107775Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3107851Z 2025-05-07T20:32:35.3107984Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3108089Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3108187Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3108305Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3108446Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3108515Z 2025-05-07T20:32:35.3108613Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3108617Z 2025-05-07T20:32:35.3108716Z moe/activation_test.py:126: 2025-05-07T20:32:35.3108842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3108940Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3109080Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3109631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3109735Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3110085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3110301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3110666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3110919Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3111288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3111454Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3111791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3111869Z fn() 2025-05-07T20:32:35.3112260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3112337Z self.fn.run( 2025-05-07T20:32:35.3112671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3112762Z kernel = self.compile( 2025-05-07T20:32:35.3113138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3113306Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3113433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3113438Z 2025-05-07T20:32:35.3113693Z self = 2025-05-07T20:32:35.3114526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3115028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10822ede0>} 2025-05-07T20:32:35.3115759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3115983Z context = 2025-05-07T20:32:35.3115994Z 2025-05-07T20:32:35.3116156Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3116413Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3116525Z module_map=module_map) 2025-05-07T20:32:35.3116684Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3116784Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3116860Z E ^ 2025-05-07T20:32:35.3117209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3117214Z 2025-05-07T20:32:35.3117624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3117628Z 2025-05-07T20:32:35.3117728Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3117945Z self=, 2025-05-07T20:32:35.3118025Z T=128, 2025-05-07T20:32:35.3118101Z D=7168, 2025-05-07T20:32:35.3118179Z scale_ub=None, 2025-05-07T20:32:35.3118265Z contiguous=False, 2025-05-07T20:32:35.3118346Z compiled=False, 2025-05-07T20:32:35.3118419Z ) 2025-05-07T20:32:35.3118636Z self = 2025-05-07T20:32:35.3118802Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3118807Z 2025-05-07T20:32:35.3118889Z @given( 2025-05-07T20:32:35.3119006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3119100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3119217Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3119327Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3119436Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3119514Z ) 2025-05-07T20:32:35.3119752Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3119848Z def test_silu_mul_quant( 2025-05-07T20:32:35.3119921Z self, 2025-05-07T20:32:35.3119993Z T: int, 2025-05-07T20:32:35.3120069Z D: int, 2025-05-07T20:32:35.3120167Z scale_ub: Optional[float], 2025-05-07T20:32:35.3120252Z contiguous: bool, 2025-05-07T20:32:35.3120341Z compiled: bool, 2025-05-07T20:32:35.3120415Z ) -> None: 2025-05-07T20:32:35.3120504Z torch.manual_seed(2025) 2025-05-07T20:32:35.3120574Z 2025-05-07T20:32:35.3120738Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3120814Z 2025-05-07T20:32:35.3120904Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3121025Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3121110Z x = x_sign * x_clamp 2025-05-07T20:32:35.3121192Z x0 = x[:, :D] 2025-05-07T20:32:35.3121269Z x1 = x[:, D:] 2025-05-07T20:32:35.3121344Z 2025-05-07T20:32:35.3121423Z if contiguous: 2025-05-07T20:32:35.3121561Z x0 = x0.contiguous() 2025-05-07T20:32:35.3121653Z x1 = x1.contiguous() 2025-05-07T20:32:35.3121725Z 2025-05-07T20:32:35.3121908Z if scale_ub is not None: 2025-05-07T20:32:35.3122019Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3122150Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3122224Z ) 2025-05-07T20:32:35.3122302Z else: 2025-05-07T20:32:35.3122395Z scale_ub_tensor = None 2025-05-07T20:32:35.3122470Z 2025-05-07T20:32:35.3122602Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3122736Z op = silu_mul_quant 2025-05-07T20:32:35.3122825Z if compiled: 2025-05-07T20:32:35.3122924Z op = torch.compile(op) 2025-05-07T20:32:35.3123027Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3123102Z 2025-05-07T20:32:35.3123191Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3123198Z 2025-05-07T20:32:35.3123297Z moe/activation_test.py:117: 2025-05-07T20:32:35.3123429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3123536Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3123634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3124129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3124225Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3124581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3124800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3125133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3125225Z kernel = self.compile( 2025-05-07T20:32:35.3125617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3125828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3125957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3125962Z 2025-05-07T20:32:35.3126163Z self = 2025-05-07T20:32:35.3126932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3127428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb109960220>} 2025-05-07T20:32:35.3128167Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3128360Z context = 2025-05-07T20:32:35.3128365Z 2025-05-07T20:32:35.3128527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3128786Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3128891Z module_map=module_map) 2025-05-07T20:32:35.3129057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3129156Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3129231Z E ^ 2025-05-07T20:32:35.3129581Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3129585Z 2025-05-07T20:32:35.3129992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3130042Z 2025-05-07T20:32:35.3130148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3130442Z self=, 2025-05-07T20:32:35.3130517Z T=4096, 2025-05-07T20:32:35.3130596Z D=5120, 2025-05-07T20:32:35.3130678Z scale_ub=1200.0, 2025-05-07T20:32:35.3130762Z contiguous=True, 2025-05-07T20:32:35.3130847Z compiled=False, 2025-05-07T20:32:35.3130917Z ) 2025-05-07T20:32:35.3131130Z self = 2025-05-07T20:32:35.3131371Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3131376Z 2025-05-07T20:32:35.3131453Z @given( 2025-05-07T20:32:35.3131570Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3131670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3131782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3131904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3132021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3132096Z ) 2025-05-07T20:32:35.3132340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3132432Z def test_silu_mul_quant( 2025-05-07T20:32:35.3132503Z self, 2025-05-07T20:32:35.3132580Z T: int, 2025-05-07T20:32:35.3132653Z D: int, 2025-05-07T20:32:35.3132748Z scale_ub: Optional[float], 2025-05-07T20:32:35.3132843Z contiguous: bool, 2025-05-07T20:32:35.3132923Z compiled: bool, 2025-05-07T20:32:35.3133055Z ) -> None: 2025-05-07T20:32:35.3133146Z torch.manual_seed(2025) 2025-05-07T20:32:35.3133213Z 2025-05-07T20:32:35.3133384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3133457Z 2025-05-07T20:32:35.3133548Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3133675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3133764Z x = x_sign * x_clamp 2025-05-07T20:32:35.3133844Z x0 = x[:, :D] 2025-05-07T20:32:35.3133924Z x1 = x[:, D:] 2025-05-07T20:32:35.3133993Z 2025-05-07T20:32:35.3134074Z if contiguous: 2025-05-07T20:32:35.3134169Z x0 = x0.contiguous() 2025-05-07T20:32:35.3134257Z x1 = x1.contiguous() 2025-05-07T20:32:35.3134329Z 2025-05-07T20:32:35.3134426Z if scale_ub is not None: 2025-05-07T20:32:35.3134530Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3134664Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3134736Z ) 2025-05-07T20:32:35.3134811Z else: 2025-05-07T20:32:35.3134901Z scale_ub_tensor = None 2025-05-07T20:32:35.3134968Z 2025-05-07T20:32:35.3135094Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3135184Z op = silu_mul_quant 2025-05-07T20:32:35.3135270Z if compiled: 2025-05-07T20:32:35.3135368Z op = torch.compile(op) 2025-05-07T20:32:35.3135481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3135550Z 2025-05-07T20:32:35.3135639Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3135649Z 2025-05-07T20:32:35.3135766Z moe/activation_test.py:117: 2025-05-07T20:32:35.3135908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3136023Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3136118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3136608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3136708Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3137059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3137335Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3137741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3137831Z kernel = self.compile( 2025-05-07T20:32:35.3138209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3138383Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3138507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3138549Z 2025-05-07T20:32:35.3138755Z self = 2025-05-07T20:32:35.3139512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3140019Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1038b09a0>} 2025-05-07T20:32:35.3140748Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3140939Z context = 2025-05-07T20:32:35.3140947Z 2025-05-07T20:32:35.3141106Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3141361Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3141473Z module_map=module_map) 2025-05-07T20:32:35.3141633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3141732Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3141814Z E ^ 2025-05-07T20:32:35.3142165Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3142170Z 2025-05-07T20:32:35.3142575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3142580Z 2025-05-07T20:32:35.3142678Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3142893Z self=, 2025-05-07T20:32:35.3142976Z T=1, 2025-05-07T20:32:35.3143050Z D=5120, 2025-05-07T20:32:35.3143133Z scale_ub=None, 2025-05-07T20:32:35.3143223Z contiguous=True, 2025-05-07T20:32:35.3143304Z compiled=True, 2025-05-07T20:32:35.3143378Z ) 2025-05-07T20:32:35.3143593Z self = 2025-05-07T20:32:35.3143754Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3143764Z 2025-05-07T20:32:35.3143844Z @given( 2025-05-07T20:32:35.3143965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3144062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3144177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3144290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3144399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3144476Z ) 2025-05-07T20:32:35.3144715Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3144813Z def test_silu_mul_quant( 2025-05-07T20:32:35.3144886Z self, 2025-05-07T20:32:35.3144963Z T: int, 2025-05-07T20:32:35.3145040Z D: int, 2025-05-07T20:32:35.3145135Z scale_ub: Optional[float], 2025-05-07T20:32:35.3145222Z contiguous: bool, 2025-05-07T20:32:35.3145308Z compiled: bool, 2025-05-07T20:32:35.3145432Z ) -> None: 2025-05-07T20:32:35.3145523Z torch.manual_seed(2025) 2025-05-07T20:32:35.3145598Z 2025-05-07T20:32:35.3145837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3145909Z 2025-05-07T20:32:35.3146002Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3146124Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3146216Z x = x_sign * x_clamp 2025-05-07T20:32:35.3146294Z x0 = x[:, :D] 2025-05-07T20:32:35.3146373Z x1 = x[:, D:] 2025-05-07T20:32:35.3146489Z 2025-05-07T20:32:35.3146569Z if contiguous: 2025-05-07T20:32:35.3146657Z x0 = x0.contiguous() 2025-05-07T20:32:35.3146749Z x1 = x1.contiguous() 2025-05-07T20:32:35.3146819Z 2025-05-07T20:32:35.3146904Z if scale_ub is not None: 2025-05-07T20:32:35.3147015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3147146Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3147222Z ) 2025-05-07T20:32:35.3147300Z else: 2025-05-07T20:32:35.3147396Z scale_ub_tensor = None 2025-05-07T20:32:35.3147464Z 2025-05-07T20:32:35.3147601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3147687Z op = silu_mul_quant 2025-05-07T20:32:35.3147769Z if compiled: 2025-05-07T20:32:35.3147863Z op = torch.compile(op) 2025-05-07T20:32:35.3147963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3148038Z 2025-05-07T20:32:35.3148129Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3148247Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3148319Z 2025-05-07T20:32:35.3148450Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3148548Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3148647Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3148767Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3148910Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3148983Z 2025-05-07T20:32:35.3149080Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3149085Z 2025-05-07T20:32:35.3149187Z moe/activation_test.py:126: 2025-05-07T20:32:35.3149313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3149418Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3149551Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3150099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3150197Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3150547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3150766Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3151135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3151387Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3151750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3151917Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3152250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3152329Z fn() 2025-05-07T20:32:35.3152726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3152807Z self.fn.run( 2025-05-07T20:32:35.3153139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3153275Z kernel = self.compile( 2025-05-07T20:32:35.3153714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3153890Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3154012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3154017Z 2025-05-07T20:32:35.3154221Z self = 2025-05-07T20:32:35.3155025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3155525Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1038b2840>} 2025-05-07T20:32:35.3156316Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3156506Z context = 2025-05-07T20:32:35.3156510Z 2025-05-07T20:32:35.3156678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3156933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3157044Z module_map=module_map) 2025-05-07T20:32:35.3157199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3157298Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3157379Z E ^ 2025-05-07T20:32:35.3157726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3157733Z 2025-05-07T20:32:35.3158139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3158147Z 2025-05-07T20:32:35.3158251Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3158468Z self=, 2025-05-07T20:32:35.3158550Z T=2048, 2025-05-07T20:32:35.3158624Z D=5120, 2025-05-07T20:32:35.3158702Z scale_ub=None, 2025-05-07T20:32:35.3158796Z contiguous=True, 2025-05-07T20:32:35.3158876Z compiled=True, 2025-05-07T20:32:35.3158949Z ) 2025-05-07T20:32:35.3159165Z self = 2025-05-07T20:32:35.3159557Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3159565Z 2025-05-07T20:32:35.3159672Z @given( 2025-05-07T20:32:35.3159793Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3159891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3160010Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3160122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3160233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3160308Z ) 2025-05-07T20:32:35.3160547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3160639Z def test_silu_mul_quant( 2025-05-07T20:32:35.3160718Z self, 2025-05-07T20:32:35.3160795Z T: int, 2025-05-07T20:32:35.3160868Z D: int, 2025-05-07T20:32:35.3160974Z scale_ub: Optional[float], 2025-05-07T20:32:35.3161061Z contiguous: bool, 2025-05-07T20:32:35.3161146Z compiled: bool, 2025-05-07T20:32:35.3161222Z ) -> None: 2025-05-07T20:32:35.3161312Z torch.manual_seed(2025) 2025-05-07T20:32:35.3161385Z 2025-05-07T20:32:35.3161632Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3161704Z 2025-05-07T20:32:35.3161925Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3162050Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3162137Z x = x_sign * x_clamp 2025-05-07T20:32:35.3162218Z x0 = x[:, :D] 2025-05-07T20:32:35.3162296Z x1 = x[:, D:] 2025-05-07T20:32:35.3162363Z 2025-05-07T20:32:35.3162445Z if contiguous: 2025-05-07T20:32:35.3162535Z x0 = x0.contiguous() 2025-05-07T20:32:35.3162627Z x1 = x1.contiguous() 2025-05-07T20:32:35.3162759Z 2025-05-07T20:32:35.3162846Z if scale_ub is not None: 2025-05-07T20:32:35.3162953Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3163083Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3163157Z ) 2025-05-07T20:32:35.3163233Z else: 2025-05-07T20:32:35.3163325Z scale_ub_tensor = None 2025-05-07T20:32:35.3163398Z 2025-05-07T20:32:35.3163531Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3163623Z op = silu_mul_quant 2025-05-07T20:32:35.3163705Z if compiled: 2025-05-07T20:32:35.3163805Z op = torch.compile(op) 2025-05-07T20:32:35.3163904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3163971Z 2025-05-07T20:32:35.3164062Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3164177Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3164254Z 2025-05-07T20:32:35.3164385Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3164483Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3164582Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3164698Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3164835Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3164912Z 2025-05-07T20:32:35.3165010Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3165014Z 2025-05-07T20:32:35.3165116Z moe/activation_test.py:126: 2025-05-07T20:32:35.3175262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3175451Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3175628Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3176216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3176327Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3176689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3176910Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3177281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3177544Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3177924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3178101Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3178443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3178524Z fn() 2025-05-07T20:32:35.3178924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3179007Z self.fn.run( 2025-05-07T20:32:35.3179345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3179440Z kernel = self.compile( 2025-05-07T20:32:35.3179899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3180160Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3180294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3180300Z 2025-05-07T20:32:35.3180510Z self = 2025-05-07T20:32:35.3181288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3181829Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1097e8e00>} 2025-05-07T20:32:35.3182566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3182764Z context = 2025-05-07T20:32:35.3182769Z 2025-05-07T20:32:35.3182940Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3183200Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3183308Z module_map=module_map) 2025-05-07T20:32:35.3183479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3183584Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3183659Z E ^ 2025-05-07T20:32:35.3184014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3184019Z 2025-05-07T20:32:35.3184427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3184434Z 2025-05-07T20:32:35.3184545Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3184766Z self=, 2025-05-07T20:32:35.3184842Z T=128, 2025-05-07T20:32:35.3184924Z D=5120, 2025-05-07T20:32:35.3185008Z scale_ub=None, 2025-05-07T20:32:35.3185092Z contiguous=True, 2025-05-07T20:32:35.3185178Z compiled=True, 2025-05-07T20:32:35.3185250Z ) 2025-05-07T20:32:35.3185475Z self = 2025-05-07T20:32:35.3185675Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3185680Z 2025-05-07T20:32:35.3185776Z @given( 2025-05-07T20:32:35.3185911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3186014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3186130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3186254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3186372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3186453Z ) 2025-05-07T20:32:35.3186698Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3186792Z def test_silu_mul_quant( 2025-05-07T20:32:35.3186872Z self, 2025-05-07T20:32:35.3186949Z T: int, 2025-05-07T20:32:35.3187029Z D: int, 2025-05-07T20:32:35.3187131Z scale_ub: Optional[float], 2025-05-07T20:32:35.3187224Z contiguous: bool, 2025-05-07T20:32:35.3187310Z compiled: bool, 2025-05-07T20:32:35.3187396Z ) -> None: 2025-05-07T20:32:35.3187492Z torch.manual_seed(2025) 2025-05-07T20:32:35.3187567Z 2025-05-07T20:32:35.3187743Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3187821Z 2025-05-07T20:32:35.3187914Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3188093Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3188183Z x = x_sign * x_clamp 2025-05-07T20:32:35.3188344Z x0 = x[:, :D] 2025-05-07T20:32:35.3188427Z x1 = x[:, D:] 2025-05-07T20:32:35.3188502Z 2025-05-07T20:32:35.3188592Z if contiguous: 2025-05-07T20:32:35.3188683Z x0 = x0.contiguous() 2025-05-07T20:32:35.3188776Z x1 = x1.contiguous() 2025-05-07T20:32:35.3188853Z 2025-05-07T20:32:35.3188943Z if scale_ub is not None: 2025-05-07T20:32:35.3189050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3189231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3189309Z ) 2025-05-07T20:32:35.3189386Z else: 2025-05-07T20:32:35.3189485Z scale_ub_tensor = None 2025-05-07T20:32:35.3189561Z 2025-05-07T20:32:35.3189696Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3189792Z op = silu_mul_quant 2025-05-07T20:32:35.3189876Z if compiled: 2025-05-07T20:32:35.3189982Z op = torch.compile(op) 2025-05-07T20:32:35.3190092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3190169Z 2025-05-07T20:32:35.3190269Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3190395Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3190506Z 2025-05-07T20:32:35.3190679Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3190786Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3190921Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3191120Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3191291Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3191381Z 2025-05-07T20:32:35.3191536Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3191541Z 2025-05-07T20:32:35.3195149Z moe/activation_test.py:126: 2025-05-07T20:32:35.3195362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3195519Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3195705Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3217366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3217501Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3217867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3218095Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3218456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3218711Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3219086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3219257Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3219589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3219666Z fn() 2025-05-07T20:32:35.3220058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3220143Z self.fn.run( 2025-05-07T20:32:35.3220471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3220566Z kernel = self.compile( 2025-05-07T20:32:35.3220935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3221106Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3221328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3221410Z 2025-05-07T20:32:35.3221614Z self = 2025-05-07T20:32:35.3222379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3222915Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102b08cc0>} 2025-05-07T20:32:35.3223651Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3223840Z context = 2025-05-07T20:32:35.3223844Z 2025-05-07T20:32:35.3224008Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3224267Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3224373Z module_map=module_map) 2025-05-07T20:32:35.3224534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3224634Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3224711Z E ^ 2025-05-07T20:32:35.3225064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3225069Z 2025-05-07T20:32:35.3225472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3225476Z 2025-05-07T20:32:35.3225576Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3225793Z self=, 2025-05-07T20:32:35.3225873Z T=4096, 2025-05-07T20:32:35.3225952Z D=5120, 2025-05-07T20:32:35.3226029Z scale_ub=None, 2025-05-07T20:32:35.3226111Z contiguous=True, 2025-05-07T20:32:35.3226194Z compiled=True, 2025-05-07T20:32:35.3226263Z ) 2025-05-07T20:32:35.3226475Z self = 2025-05-07T20:32:35.3226645Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3226653Z 2025-05-07T20:32:35.3226725Z @given( 2025-05-07T20:32:35.3226844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3226941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3227051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3227167Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3227276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3227351Z ) 2025-05-07T20:32:35.3227598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3227688Z def test_silu_mul_quant( 2025-05-07T20:32:35.3227758Z self, 2025-05-07T20:32:35.3227835Z T: int, 2025-05-07T20:32:35.3227910Z D: int, 2025-05-07T20:32:35.3228004Z scale_ub: Optional[float], 2025-05-07T20:32:35.3228094Z contiguous: bool, 2025-05-07T20:32:35.3228174Z compiled: bool, 2025-05-07T20:32:35.3228253Z ) -> None: 2025-05-07T20:32:35.3228349Z torch.manual_seed(2025) 2025-05-07T20:32:35.3228419Z 2025-05-07T20:32:35.3228590Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3228661Z 2025-05-07T20:32:35.3228752Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3228879Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3228964Z x = x_sign * x_clamp 2025-05-07T20:32:35.3229092Z x0 = x[:, :D] 2025-05-07T20:32:35.3229171Z x1 = x[:, D:] 2025-05-07T20:32:35.3229242Z 2025-05-07T20:32:35.3229423Z if contiguous: 2025-05-07T20:32:35.3229518Z x0 = x0.contiguous() 2025-05-07T20:32:35.3229605Z x1 = x1.contiguous() 2025-05-07T20:32:35.3229680Z 2025-05-07T20:32:35.3229767Z if scale_ub is not None: 2025-05-07T20:32:35.3229870Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3230004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3230111Z ) 2025-05-07T20:32:35.3230182Z else: 2025-05-07T20:32:35.3230279Z scale_ub_tensor = None 2025-05-07T20:32:35.3230349Z 2025-05-07T20:32:35.3230473Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3230568Z op = silu_mul_quant 2025-05-07T20:32:35.3230649Z if compiled: 2025-05-07T20:32:35.3230744Z op = torch.compile(op) 2025-05-07T20:32:35.3230851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3230921Z 2025-05-07T20:32:35.3231017Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3231134Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3231204Z 2025-05-07T20:32:35.3231340Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3231440Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3231534Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3231655Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3231795Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3231862Z 2025-05-07T20:32:35.3231961Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3231965Z 2025-05-07T20:32:35.3232059Z moe/activation_test.py:126: 2025-05-07T20:32:35.3232187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3232292Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3232432Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3232991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3233101Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3233453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3233672Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3234046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3234299Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3234670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3234836Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3235177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3235258Z fn() 2025-05-07T20:32:35.3235680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3235767Z self.fn.run( 2025-05-07T20:32:35.3236116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3236211Z kernel = self.compile( 2025-05-07T20:32:35.3236588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3236763Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3236891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3236944Z 2025-05-07T20:32:35.3237155Z self = 2025-05-07T20:32:35.3237989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3238493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102b0ad40>} 2025-05-07T20:32:35.3239265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3239453Z context = 2025-05-07T20:32:35.3239458Z 2025-05-07T20:32:35.3239630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3239893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3240006Z module_map=module_map) 2025-05-07T20:32:35.3240167Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3240272Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3240360Z E ^ 2025-05-07T20:32:35.3240709Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3240717Z 2025-05-07T20:32:35.3241131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3241136Z 2025-05-07T20:32:35.3241238Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3241455Z self=, 2025-05-07T20:32:35.3241535Z T=16384, 2025-05-07T20:32:35.3241613Z D=5120, 2025-05-07T20:32:35.3241693Z scale_ub=None, 2025-05-07T20:32:35.3241781Z contiguous=True, 2025-05-07T20:32:35.3241869Z compiled=True, 2025-05-07T20:32:35.3241940Z ) 2025-05-07T20:32:35.3242161Z self = 2025-05-07T20:32:35.3242334Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3242338Z 2025-05-07T20:32:35.3242420Z @given( 2025-05-07T20:32:35.3242539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3242642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3242759Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3242875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3242989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3243066Z ) 2025-05-07T20:32:35.3243310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3243404Z def test_silu_mul_quant( 2025-05-07T20:32:35.3243481Z self, 2025-05-07T20:32:35.3243557Z T: int, 2025-05-07T20:32:35.3243638Z D: int, 2025-05-07T20:32:35.3243737Z scale_ub: Optional[float], 2025-05-07T20:32:35.3243824Z contiguous: bool, 2025-05-07T20:32:35.3243916Z compiled: bool, 2025-05-07T20:32:35.3243994Z ) -> None: 2025-05-07T20:32:35.3244087Z torch.manual_seed(2025) 2025-05-07T20:32:35.3244163Z 2025-05-07T20:32:35.3244329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3244402Z 2025-05-07T20:32:35.3244499Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3244621Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3244709Z x = x_sign * x_clamp 2025-05-07T20:32:35.3244792Z x0 = x[:, :D] 2025-05-07T20:32:35.3244871Z x1 = x[:, D:] 2025-05-07T20:32:35.3244944Z 2025-05-07T20:32:35.3245076Z if contiguous: 2025-05-07T20:32:35.3245166Z x0 = x0.contiguous() 2025-05-07T20:32:35.3245262Z x1 = x1.contiguous() 2025-05-07T20:32:35.3245409Z 2025-05-07T20:32:35.3245499Z if scale_ub is not None: 2025-05-07T20:32:35.3245610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3245742Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3245816Z ) 2025-05-07T20:32:35.3245902Z else: 2025-05-07T20:32:35.3245992Z scale_ub_tensor = None 2025-05-07T20:32:35.3246066Z 2025-05-07T20:32:35.3246239Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3246328Z op = silu_mul_quant 2025-05-07T20:32:35.3246410Z if compiled: 2025-05-07T20:32:35.3246517Z op = torch.compile(op) 2025-05-07T20:32:35.3246625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3246704Z 2025-05-07T20:32:35.3246794Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3246917Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3246995Z 2025-05-07T20:32:35.3247136Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3247238Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3247342Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3247462Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3247600Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3247678Z 2025-05-07T20:32:35.3247780Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3247785Z 2025-05-07T20:32:35.3247888Z moe/activation_test.py:126: 2025-05-07T20:32:35.3248018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3248123Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3248260Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3248811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3248915Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3249273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3249493Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3249861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3250114Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3250483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3250651Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3250987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3251072Z fn() 2025-05-07T20:32:35.3251468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3251550Z self.fn.run( 2025-05-07T20:32:35.3251890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3251981Z kernel = self.compile( 2025-05-07T20:32:35.3252355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3252536Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3252664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3252668Z 2025-05-07T20:32:35.3252879Z self = 2025-05-07T20:32:35.3253891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3254394Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102c82840>} 2025-05-07T20:32:35.3255132Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3255361Z context = 2025-05-07T20:32:35.3255365Z 2025-05-07T20:32:35.3255535Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3255793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3255900Z module_map=module_map) 2025-05-07T20:32:35.3256070Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3256177Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3256256Z E ^ 2025-05-07T20:32:35.3256606Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3256611Z 2025-05-07T20:32:35.3257019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3257027Z 2025-05-07T20:32:35.3257134Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3257352Z self=, 2025-05-07T20:32:35.3257437Z T=1, 2025-05-07T20:32:35.3257510Z D=5120, 2025-05-07T20:32:35.3257589Z scale_ub=1200.0, 2025-05-07T20:32:35.3257677Z contiguous=True, 2025-05-07T20:32:35.3257762Z compiled=True, 2025-05-07T20:32:35.3257838Z ) 2025-05-07T20:32:35.3258068Z self = 2025-05-07T20:32:35.3258230Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.3258234Z 2025-05-07T20:32:35.3258313Z @given( 2025-05-07T20:32:35.3258438Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3258538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3258652Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3258775Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3258888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3258969Z ) 2025-05-07T20:32:35.3259430Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3259561Z def test_silu_mul_quant( 2025-05-07T20:32:35.3259647Z self, 2025-05-07T20:32:35.3259728Z T: int, 2025-05-07T20:32:35.3259805Z D: int, 2025-05-07T20:32:35.3259912Z scale_ub: Optional[float], 2025-05-07T20:32:35.3260008Z contiguous: bool, 2025-05-07T20:32:35.3260094Z compiled: bool, 2025-05-07T20:32:35.3260180Z ) -> None: 2025-05-07T20:32:35.3260272Z torch.manual_seed(2025) 2025-05-07T20:32:35.3260341Z 2025-05-07T20:32:35.3260515Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3260587Z 2025-05-07T20:32:35.3260683Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3260806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3260897Z x = x_sign * x_clamp 2025-05-07T20:32:35.3260985Z x0 = x[:, :D] 2025-05-07T20:32:35.3261065Z x1 = x[:, D:] 2025-05-07T20:32:35.3261138Z 2025-05-07T20:32:35.3261228Z if contiguous: 2025-05-07T20:32:35.3261319Z x0 = x0.contiguous() 2025-05-07T20:32:35.3261407Z x1 = x1.contiguous() 2025-05-07T20:32:35.3261628Z 2025-05-07T20:32:35.3261720Z if scale_ub is not None: 2025-05-07T20:32:35.3261824Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3262116Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3262200Z ) 2025-05-07T20:32:35.3262281Z else: 2025-05-07T20:32:35.3262374Z scale_ub_tensor = None 2025-05-07T20:32:35.3262449Z 2025-05-07T20:32:35.3262583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3262678Z op = silu_mul_quant 2025-05-07T20:32:35.3262828Z if compiled: 2025-05-07T20:32:35.3262936Z op = torch.compile(op) 2025-05-07T20:32:35.3263041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3263111Z 2025-05-07T20:32:35.3263207Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3263212Z 2025-05-07T20:32:35.3263309Z moe/activation_test.py:117: 2025-05-07T20:32:35.3263439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3263548Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3263653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3264023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3264116Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3264603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3264707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3265059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3265284Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3265616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3265711Z kernel = self.compile( 2025-05-07T20:32:35.3266098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3266271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3266397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3266401Z 2025-05-07T20:32:35.3266610Z self = 2025-05-07T20:32:35.3267373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3267878Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102dccf40>} 2025-05-07T20:32:35.3268613Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3268805Z context = 2025-05-07T20:32:35.3268810Z 2025-05-07T20:32:35.3268971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3269226Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3269343Z module_map=module_map) 2025-05-07T20:32:35.3269501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3269597Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3269680Z E ^ 2025-05-07T20:32:35.3270031Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3270082Z 2025-05-07T20:32:35.3270495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3270573Z 2025-05-07T20:32:35.3270678Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3270900Z self=, 2025-05-07T20:32:35.3270984Z T=1, 2025-05-07T20:32:35.3271064Z D=5120, 2025-05-07T20:32:35.3271144Z scale_ub=None, 2025-05-07T20:32:35.3271236Z contiguous=False, 2025-05-07T20:32:35.3271316Z compiled=True, 2025-05-07T20:32:35.3271437Z ) 2025-05-07T20:32:35.3271652Z self = 2025-05-07T20:32:35.3271813Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3271818Z 2025-05-07T20:32:35.3271895Z @given( 2025-05-07T20:32:35.3272013Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3272111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3272230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3272348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3272459Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3272540Z ) 2025-05-07T20:32:35.3272782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3272878Z def test_silu_mul_quant( 2025-05-07T20:32:35.3272953Z self, 2025-05-07T20:32:35.3273028Z T: int, 2025-05-07T20:32:35.3273110Z D: int, 2025-05-07T20:32:35.3273211Z scale_ub: Optional[float], 2025-05-07T20:32:35.3273299Z contiguous: bool, 2025-05-07T20:32:35.3273388Z compiled: bool, 2025-05-07T20:32:35.3273466Z ) -> None: 2025-05-07T20:32:35.3273557Z torch.manual_seed(2025) 2025-05-07T20:32:35.3273631Z 2025-05-07T20:32:35.3273798Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3273873Z 2025-05-07T20:32:35.3273969Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3274091Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3274191Z x = x_sign * x_clamp 2025-05-07T20:32:35.3274269Z x0 = x[:, :D] 2025-05-07T20:32:35.3274345Z x1 = x[:, D:] 2025-05-07T20:32:35.3274425Z 2025-05-07T20:32:35.3274509Z if contiguous: 2025-05-07T20:32:35.3274601Z x0 = x0.contiguous() 2025-05-07T20:32:35.3274693Z x1 = x1.contiguous() 2025-05-07T20:32:35.3274767Z 2025-05-07T20:32:35.3274856Z if scale_ub is not None: 2025-05-07T20:32:35.3274970Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3275100Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3275173Z ) 2025-05-07T20:32:35.3275255Z else: 2025-05-07T20:32:35.3275352Z scale_ub_tensor = None 2025-05-07T20:32:35.3275430Z 2025-05-07T20:32:35.3275558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3275650Z op = silu_mul_quant 2025-05-07T20:32:35.3275739Z if compiled: 2025-05-07T20:32:35.3275842Z op = torch.compile(op) 2025-05-07T20:32:35.3275950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3276026Z 2025-05-07T20:32:35.3276116Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3276235Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3276313Z 2025-05-07T20:32:35.3276453Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3276558Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3276662Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3276783Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3276925Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3277007Z 2025-05-07T20:32:35.3277107Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3277160Z 2025-05-07T20:32:35.3277265Z moe/activation_test.py:126: 2025-05-07T20:32:35.3277466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3277572Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3277712Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3278262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3278365Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3278757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3278976Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3279340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3279595Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3279968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3280143Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3280480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3280562Z fn() 2025-05-07T20:32:35.3280957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3281041Z self.fn.run( 2025-05-07T20:32:35.3281380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3281475Z kernel = self.compile( 2025-05-07T20:32:35.3281855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3282031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3282162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3282167Z 2025-05-07T20:32:35.3282377Z self = 2025-05-07T20:32:35.3283135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3283644Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102dcef20>} 2025-05-07T20:32:35.3284373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3284563Z context = 2025-05-07T20:32:35.3284570Z 2025-05-07T20:32:35.3284739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3285000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3285110Z module_map=module_map) 2025-05-07T20:32:35.3285271Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3285375Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3285462Z E ^ 2025-05-07T20:32:35.3285858Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3285862Z 2025-05-07T20:32:35.3286269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3286324Z 2025-05-07T20:32:35.3286427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3286723Z self=, 2025-05-07T20:32:35.3286805Z T=1, 2025-05-07T20:32:35.3286880Z D=5120, 2025-05-07T20:32:35.3286959Z scale_ub=None, 2025-05-07T20:32:35.3287049Z contiguous=True, 2025-05-07T20:32:35.3287133Z compiled=False, 2025-05-07T20:32:35.3287204Z ) 2025-05-07T20:32:35.3287423Z self = 2025-05-07T20:32:35.3287585Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3287629Z 2025-05-07T20:32:35.3287710Z @given( 2025-05-07T20:32:35.3287827Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3287927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3288047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3288162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3288277Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3288358Z ) 2025-05-07T20:32:35.3288602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3288693Z def test_silu_mul_quant( 2025-05-07T20:32:35.3288777Z self, 2025-05-07T20:32:35.3288852Z T: int, 2025-05-07T20:32:35.3288926Z D: int, 2025-05-07T20:32:35.3289030Z scale_ub: Optional[float], 2025-05-07T20:32:35.3289119Z contiguous: bool, 2025-05-07T20:32:35.3289211Z compiled: bool, 2025-05-07T20:32:35.3289291Z ) -> None: 2025-05-07T20:32:35.3289384Z torch.manual_seed(2025) 2025-05-07T20:32:35.3289461Z 2025-05-07T20:32:35.3289627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3289703Z 2025-05-07T20:32:35.3289801Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3289925Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3290016Z x = x_sign * x_clamp 2025-05-07T20:32:35.3290102Z x0 = x[:, :D] 2025-05-07T20:32:35.3290181Z x1 = x[:, D:] 2025-05-07T20:32:35.3290259Z 2025-05-07T20:32:35.3290348Z if contiguous: 2025-05-07T20:32:35.3290437Z x0 = x0.contiguous() 2025-05-07T20:32:35.3290529Z x1 = x1.contiguous() 2025-05-07T20:32:35.3290604Z 2025-05-07T20:32:35.3290693Z if scale_ub is not None: 2025-05-07T20:32:35.3290799Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3290932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3291011Z ) 2025-05-07T20:32:35.3291091Z else: 2025-05-07T20:32:35.3291186Z scale_ub_tensor = None 2025-05-07T20:32:35.3291259Z 2025-05-07T20:32:35.3291395Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3291485Z op = silu_mul_quant 2025-05-07T20:32:35.3291568Z if compiled: 2025-05-07T20:32:35.3291676Z op = torch.compile(op) 2025-05-07T20:32:35.3291781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3291865Z 2025-05-07T20:32:35.3291960Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3291965Z 2025-05-07T20:32:35.3292063Z moe/activation_test.py:117: 2025-05-07T20:32:35.3292196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3292295Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3292395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3292893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3293046Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3293408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3293628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3294036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3294210Z kernel = self.compile( 2025-05-07T20:32:35.3294588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3294761Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3294893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3294897Z 2025-05-07T20:32:35.3295140Z self = 2025-05-07T20:32:35.3295953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3296448Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102dcfa60>} 2025-05-07T20:32:35.3297190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3297383Z context = 2025-05-07T20:32:35.3297387Z 2025-05-07T20:32:35.3297551Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3297815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3297921Z module_map=module_map) 2025-05-07T20:32:35.3298081Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3298186Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3298264Z E ^ 2025-05-07T20:32:35.3298618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3298626Z 2025-05-07T20:32:35.3299032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3299036Z 2025-05-07T20:32:35.3299137Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3299361Z self=, 2025-05-07T20:32:35.3299439Z T=128, 2025-05-07T20:32:35.3299523Z D=5120, 2025-05-07T20:32:35.3299606Z scale_ub=None, 2025-05-07T20:32:35.3299694Z contiguous=False, 2025-05-07T20:32:35.3299783Z compiled=True, 2025-05-07T20:32:35.3299856Z ) 2025-05-07T20:32:35.3300070Z self = 2025-05-07T20:32:35.3300244Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3300251Z 2025-05-07T20:32:35.3300329Z @given( 2025-05-07T20:32:35.3300449Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3300557Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3300670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3300792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3300903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3300979Z ) 2025-05-07T20:32:35.3301225Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3301321Z def test_silu_mul_quant( 2025-05-07T20:32:35.3301397Z self, 2025-05-07T20:32:35.3301479Z T: int, 2025-05-07T20:32:35.3301556Z D: int, 2025-05-07T20:32:35.3301652Z scale_ub: Optional[float], 2025-05-07T20:32:35.3301747Z contiguous: bool, 2025-05-07T20:32:35.3301832Z compiled: bool, 2025-05-07T20:32:35.3301907Z ) -> None: 2025-05-07T20:32:35.3302054Z torch.manual_seed(2025) 2025-05-07T20:32:35.3302125Z 2025-05-07T20:32:35.3302292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3302445Z 2025-05-07T20:32:35.3302539Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3302667Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3302754Z x = x_sign * x_clamp 2025-05-07T20:32:35.3302831Z x0 = x[:, :D] 2025-05-07T20:32:35.3302917Z x1 = x[:, D:] 2025-05-07T20:32:35.3302989Z 2025-05-07T20:32:35.3303072Z if contiguous: 2025-05-07T20:32:35.3303209Z x0 = x0.contiguous() 2025-05-07T20:32:35.3303298Z x1 = x1.contiguous() 2025-05-07T20:32:35.3303372Z 2025-05-07T20:32:35.3303469Z if scale_ub is not None: 2025-05-07T20:32:35.3303574Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3303706Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3303790Z ) 2025-05-07T20:32:35.3303869Z else: 2025-05-07T20:32:35.3303970Z scale_ub_tensor = None 2025-05-07T20:32:35.3304043Z 2025-05-07T20:32:35.3304176Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3304268Z op = silu_mul_quant 2025-05-07T20:32:35.3304351Z if compiled: 2025-05-07T20:32:35.3304449Z op = torch.compile(op) 2025-05-07T20:32:35.3304560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3304634Z 2025-05-07T20:32:35.3304723Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3304731Z 2025-05-07T20:32:35.3304833Z moe/activation_test.py:117: 2025-05-07T20:32:35.3304960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3305062Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3305162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3305523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3305621Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3306110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3306207Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3306563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3306782Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3307121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3307216Z kernel = self.compile( 2025-05-07T20:32:35.3307590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3307768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3307895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3307899Z 2025-05-07T20:32:35.3308109Z self = 2025-05-07T20:32:35.3308873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3309369Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10201d1c0>} 2025-05-07T20:32:35.3310108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3310295Z context = 2025-05-07T20:32:35.3310346Z 2025-05-07T20:32:35.3310517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3310845Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3310953Z module_map=module_map) 2025-05-07T20:32:35.3311119Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3311220Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3311298Z E ^ 2025-05-07T20:32:35.3311653Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3311695Z 2025-05-07T20:32:35.3312102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3312106Z 2025-05-07T20:32:35.3312213Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3312431Z self=, 2025-05-07T20:32:35.3312512Z T=128, 2025-05-07T20:32:35.3312596Z D=7168, 2025-05-07T20:32:35.3312685Z scale_ub=1200.0, 2025-05-07T20:32:35.3312770Z contiguous=False, 2025-05-07T20:32:35.3312861Z compiled=False, 2025-05-07T20:32:35.3312935Z ) 2025-05-07T20:32:35.3313156Z self = 2025-05-07T20:32:35.3313326Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.3313330Z 2025-05-07T20:32:35.3313404Z @given( 2025-05-07T20:32:35.3313536Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3313636Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3313750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3313874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3313991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3314073Z ) 2025-05-07T20:32:35.3314317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3314415Z def test_silu_mul_quant( 2025-05-07T20:32:35.3314500Z self, 2025-05-07T20:32:35.3314579Z T: int, 2025-05-07T20:32:35.3314657Z D: int, 2025-05-07T20:32:35.3314760Z scale_ub: Optional[float], 2025-05-07T20:32:35.3314849Z contiguous: bool, 2025-05-07T20:32:35.3314934Z compiled: bool, 2025-05-07T20:32:35.3315015Z ) -> None: 2025-05-07T20:32:35.3315108Z torch.manual_seed(2025) 2025-05-07T20:32:35.3315184Z 2025-05-07T20:32:35.3315358Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3315431Z 2025-05-07T20:32:35.3315521Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3315666Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3315766Z x = x_sign * x_clamp 2025-05-07T20:32:35.3315863Z x0 = x[:, :D] 2025-05-07T20:32:35.3315958Z x1 = x[:, D:] 2025-05-07T20:32:35.3316029Z 2025-05-07T20:32:35.3316116Z if contiguous: 2025-05-07T20:32:35.3316212Z x0 = x0.contiguous() 2025-05-07T20:32:35.3316300Z x1 = x1.contiguous() 2025-05-07T20:32:35.3316376Z 2025-05-07T20:32:35.3316466Z if scale_ub is not None: 2025-05-07T20:32:35.3316569Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3316709Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3316783Z ) 2025-05-07T20:32:35.3316860Z else: 2025-05-07T20:32:35.3316961Z scale_ub_tensor = None 2025-05-07T20:32:35.3317034Z 2025-05-07T20:32:35.3317168Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3317256Z op = silu_mul_quant 2025-05-07T20:32:35.3317339Z if compiled: 2025-05-07T20:32:35.3317445Z op = torch.compile(op) 2025-05-07T20:32:35.3317548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3317670Z 2025-05-07T20:32:35.3317768Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3317773Z 2025-05-07T20:32:35.3317944Z moe/activation_test.py:117: 2025-05-07T20:32:35.3318074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3318177Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3318272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3318770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3318929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3319279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3319504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3319836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3319931Z kernel = self.compile( 2025-05-07T20:32:35.3320316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3320489Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3320619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3320623Z 2025-05-07T20:32:35.3320825Z self = 2025-05-07T20:32:35.3321583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3322086Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10201cd60>} 2025-05-07T20:32:35.3322823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3323014Z context = 2025-05-07T20:32:35.3323019Z 2025-05-07T20:32:35.3323181Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3323446Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3323554Z module_map=module_map) 2025-05-07T20:32:35.3323715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3323819Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3323897Z E ^ 2025-05-07T20:32:35.3328819Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3328836Z 2025-05-07T20:32:35.3329439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3329445Z 2025-05-07T20:32:35.3329556Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3329780Z self=, 2025-05-07T20:32:35.3329857Z T=128, 2025-05-07T20:32:35.3329940Z D=5120, 2025-05-07T20:32:35.3330021Z scale_ub=None, 2025-05-07T20:32:35.3330111Z contiguous=False, 2025-05-07T20:32:35.3330197Z compiled=False, 2025-05-07T20:32:35.3330268Z ) 2025-05-07T20:32:35.3330492Z self = 2025-05-07T20:32:35.3330661Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3330666Z 2025-05-07T20:32:35.3330742Z @given( 2025-05-07T20:32:35.3330866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3331049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3331163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3331359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3331473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3331548Z ) 2025-05-07T20:32:35.3331799Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3331893Z def test_silu_mul_quant( 2025-05-07T20:32:35.3331977Z self, 2025-05-07T20:32:35.3332055Z T: int, 2025-05-07T20:32:35.3332206Z D: int, 2025-05-07T20:32:35.3332308Z scale_ub: Optional[float], 2025-05-07T20:32:35.3332398Z contiguous: bool, 2025-05-07T20:32:35.3332492Z compiled: bool, 2025-05-07T20:32:35.3332574Z ) -> None: 2025-05-07T20:32:35.3332667Z torch.manual_seed(2025) 2025-05-07T20:32:35.3332736Z 2025-05-07T20:32:35.3332906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3332983Z 2025-05-07T20:32:35.3333151Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3333284Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3333369Z x = x_sign * x_clamp 2025-05-07T20:32:35.3333452Z x0 = x[:, :D] 2025-05-07T20:32:35.3333530Z x1 = x[:, D:] 2025-05-07T20:32:35.3333602Z 2025-05-07T20:32:35.3333689Z if contiguous: 2025-05-07T20:32:35.3333780Z x0 = x0.contiguous() 2025-05-07T20:32:35.3333866Z x1 = x1.contiguous() 2025-05-07T20:32:35.3333944Z 2025-05-07T20:32:35.3334036Z if scale_ub is not None: 2025-05-07T20:32:35.3334140Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3334275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3334350Z ) 2025-05-07T20:32:35.3334423Z else: 2025-05-07T20:32:35.3334520Z scale_ub_tensor = None 2025-05-07T20:32:35.3334595Z 2025-05-07T20:32:35.3334726Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3334814Z op = silu_mul_quant 2025-05-07T20:32:35.3334901Z if compiled: 2025-05-07T20:32:35.3335008Z op = torch.compile(op) 2025-05-07T20:32:35.3335110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3335182Z 2025-05-07T20:32:35.3335275Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3335279Z 2025-05-07T20:32:35.3335373Z moe/activation_test.py:117: 2025-05-07T20:32:35.3335498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3335604Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3335701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3336195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3336289Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3336645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3336871Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3337203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3337293Z kernel = self.compile( 2025-05-07T20:32:35.3337672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3337846Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3337978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3337983Z 2025-05-07T20:32:35.3338183Z self = 2025-05-07T20:32:35.3338948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3339568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10201e2a0>} 2025-05-07T20:32:35.3340303Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3340532Z context = 2025-05-07T20:32:35.3340537Z 2025-05-07T20:32:35.3340698Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3340959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3341064Z module_map=module_map) 2025-05-07T20:32:35.3341226Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3341330Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3341405Z E ^ 2025-05-07T20:32:35.3341750Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3341755Z 2025-05-07T20:32:35.3342162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3342167Z 2025-05-07T20:32:35.3342268Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3342491Z self=, 2025-05-07T20:32:35.3342564Z T=128, 2025-05-07T20:32:35.3342639Z D=5120, 2025-05-07T20:32:35.3342723Z scale_ub=1200.0, 2025-05-07T20:32:35.3342804Z contiguous=True, 2025-05-07T20:32:35.3342884Z compiled=False, 2025-05-07T20:32:35.3342956Z ) 2025-05-07T20:32:35.3343170Z self = 2025-05-07T20:32:35.3343340Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3343349Z 2025-05-07T20:32:35.3343421Z @given( 2025-05-07T20:32:35.3343537Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3343639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3343750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3343865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3343986Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3344061Z ) 2025-05-07T20:32:35.3344304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3344400Z def test_silu_mul_quant( 2025-05-07T20:32:35.3344476Z self, 2025-05-07T20:32:35.3344552Z T: int, 2025-05-07T20:32:35.3344629Z D: int, 2025-05-07T20:32:35.3344728Z scale_ub: Optional[float], 2025-05-07T20:32:35.3344819Z contiguous: bool, 2025-05-07T20:32:35.3344904Z compiled: bool, 2025-05-07T20:32:35.3344983Z ) -> None: 2025-05-07T20:32:35.3345083Z torch.manual_seed(2025) 2025-05-07T20:32:35.3345156Z 2025-05-07T20:32:35.3345323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3345399Z 2025-05-07T20:32:35.3345490Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3345613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3345703Z x = x_sign * x_clamp 2025-05-07T20:32:35.3345784Z x0 = x[:, :D] 2025-05-07T20:32:35.3345862Z x1 = x[:, D:] 2025-05-07T20:32:35.3345937Z 2025-05-07T20:32:35.3346017Z if contiguous: 2025-05-07T20:32:35.3346108Z x0 = x0.contiguous() 2025-05-07T20:32:35.3346194Z x1 = x1.contiguous() 2025-05-07T20:32:35.3346266Z 2025-05-07T20:32:35.3346359Z if scale_ub is not None: 2025-05-07T20:32:35.3346513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3346646Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3346798Z ) 2025-05-07T20:32:35.3346872Z else: 2025-05-07T20:32:35.3346966Z scale_ub_tensor = None 2025-05-07T20:32:35.3347041Z 2025-05-07T20:32:35.3347170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3347257Z op = silu_mul_quant 2025-05-07T20:32:35.3347345Z if compiled: 2025-05-07T20:32:35.3347443Z op = torch.compile(op) 2025-05-07T20:32:35.3347591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3347660Z 2025-05-07T20:32:35.3347748Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3347752Z 2025-05-07T20:32:35.3347850Z moe/activation_test.py:117: 2025-05-07T20:32:35.3347976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3348081Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3348182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3348679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3348776Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3349133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3349351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3349691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3349781Z kernel = self.compile( 2025-05-07T20:32:35.3350153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3350328Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3350453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3350458Z 2025-05-07T20:32:35.3350667Z self = 2025-05-07T20:32:35.3351427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3351920Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10201f380>} 2025-05-07T20:32:35.3352656Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3352843Z context = 2025-05-07T20:32:35.3352850Z 2025-05-07T20:32:35.3353017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3353280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3353384Z module_map=module_map) 2025-05-07T20:32:35.3353544Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3353641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3353720Z E ^ 2025-05-07T20:32:35.3354068Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3354075Z 2025-05-07T20:32:35.3354478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3354483Z 2025-05-07T20:32:35.3354589Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3354806Z self=, 2025-05-07T20:32:35.3354929Z T=1, 2025-05-07T20:32:35.3355002Z D=7168, 2025-05-07T20:32:35.3355180Z scale_ub=1200.0, 2025-05-07T20:32:35.3355269Z contiguous=True, 2025-05-07T20:32:35.3355349Z compiled=True, 2025-05-07T20:32:35.3355420Z ) 2025-05-07T20:32:35.3355667Z self = 2025-05-07T20:32:35.3355849Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.3355854Z 2025-05-07T20:32:35.3355930Z @given( 2025-05-07T20:32:35.3356096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3356194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3356305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3356421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3356530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3356605Z ) 2025-05-07T20:32:35.3356845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3356944Z def test_silu_mul_quant( 2025-05-07T20:32:35.3357025Z self, 2025-05-07T20:32:35.3357100Z T: int, 2025-05-07T20:32:35.3357173Z D: int, 2025-05-07T20:32:35.3357272Z scale_ub: Optional[float], 2025-05-07T20:32:35.3357360Z contiguous: bool, 2025-05-07T20:32:35.3357444Z compiled: bool, 2025-05-07T20:32:35.3357528Z ) -> None: 2025-05-07T20:32:35.3357622Z torch.manual_seed(2025) 2025-05-07T20:32:35.3357699Z 2025-05-07T20:32:35.3357869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3357939Z 2025-05-07T20:32:35.3358036Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3358160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3358245Z x = x_sign * x_clamp 2025-05-07T20:32:35.3358328Z x0 = x[:, :D] 2025-05-07T20:32:35.3358407Z x1 = x[:, D:] 2025-05-07T20:32:35.3358480Z 2025-05-07T20:32:35.3358568Z if contiguous: 2025-05-07T20:32:35.3358661Z x0 = x0.contiguous() 2025-05-07T20:32:35.3358749Z x1 = x1.contiguous() 2025-05-07T20:32:35.3358824Z 2025-05-07T20:32:35.3358911Z if scale_ub is not None: 2025-05-07T20:32:35.3359017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3359152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3359426Z ) 2025-05-07T20:32:35.3359540Z else: 2025-05-07T20:32:35.3359641Z scale_ub_tensor = None 2025-05-07T20:32:35.3359713Z 2025-05-07T20:32:35.3359844Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3359932Z op = silu_mul_quant 2025-05-07T20:32:35.3360017Z if compiled: 2025-05-07T20:32:35.3360122Z op = torch.compile(op) 2025-05-07T20:32:35.3360226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3360298Z 2025-05-07T20:32:35.3360391Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3360396Z 2025-05-07T20:32:35.3360493Z moe/activation_test.py:117: 2025-05-07T20:32:35.3360625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3360730Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3360828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3361195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3361292Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3361773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3361873Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3362222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3362538Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3362982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3363076Z kernel = self.compile( 2025-05-07T20:32:35.3363453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3363622Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3363747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3363809Z 2025-05-07T20:32:35.3364019Z self = 2025-05-07T20:32:35.3364792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3365297Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0173c4a40>} 2025-05-07T20:32:35.3366075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3366267Z context = 2025-05-07T20:32:35.3366275Z 2025-05-07T20:32:35.3366435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3366691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3366802Z module_map=module_map) 2025-05-07T20:32:35.3366961Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3367059Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3367137Z E ^ 2025-05-07T20:32:35.3367489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3367494Z 2025-05-07T20:32:35.3367902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3367906Z 2025-05-07T20:32:35.3368006Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3368224Z self=, 2025-05-07T20:32:35.3368306Z T=1, 2025-05-07T20:32:35.3368381Z D=7168, 2025-05-07T20:32:35.3368465Z scale_ub=1200.0, 2025-05-07T20:32:35.3368554Z contiguous=False, 2025-05-07T20:32:35.3368633Z compiled=True, 2025-05-07T20:32:35.3368710Z ) 2025-05-07T20:32:35.3368921Z self = 2025-05-07T20:32:35.3369084Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3369092Z 2025-05-07T20:32:35.3369172Z @given( 2025-05-07T20:32:35.3369297Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3369395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3369511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3369625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3369741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3369814Z ) 2025-05-07T20:32:35.3370054Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3370148Z def test_silu_mul_quant( 2025-05-07T20:32:35.3370222Z self, 2025-05-07T20:32:35.3370298Z T: int, 2025-05-07T20:32:35.3370379Z D: int, 2025-05-07T20:32:35.3370475Z scale_ub: Optional[float], 2025-05-07T20:32:35.3370563Z contiguous: bool, 2025-05-07T20:32:35.3370650Z compiled: bool, 2025-05-07T20:32:35.3370772Z ) -> None: 2025-05-07T20:32:35.3370864Z torch.manual_seed(2025) 2025-05-07T20:32:35.3370936Z 2025-05-07T20:32:35.3371175Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3371250Z 2025-05-07T20:32:35.3371342Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3371464Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3371556Z x = x_sign * x_clamp 2025-05-07T20:32:35.3371634Z x0 = x[:, :D] 2025-05-07T20:32:35.3371715Z x1 = x[:, D:] 2025-05-07T20:32:35.3371832Z 2025-05-07T20:32:35.3371912Z if contiguous: 2025-05-07T20:32:35.3372001Z x0 = x0.contiguous() 2025-05-07T20:32:35.3372091Z x1 = x1.contiguous() 2025-05-07T20:32:35.3372162Z 2025-05-07T20:32:35.3372249Z if scale_ub is not None: 2025-05-07T20:32:35.3372355Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3372485Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3372558Z ) 2025-05-07T20:32:35.3372633Z else: 2025-05-07T20:32:35.3372732Z scale_ub_tensor = None 2025-05-07T20:32:35.3372810Z 2025-05-07T20:32:35.3372935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3373071Z op = silu_mul_quant 2025-05-07T20:32:35.3373155Z if compiled: 2025-05-07T20:32:35.3373252Z op = torch.compile(op) 2025-05-07T20:32:35.3373353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3373428Z 2025-05-07T20:32:35.3373523Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3373527Z 2025-05-07T20:32:35.3373621Z moe/activation_test.py:117: 2025-05-07T20:32:35.3373748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3373845Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3373950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3374309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3374404Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3374891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3374985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3375332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3375554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3375887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3375983Z kernel = self.compile( 2025-05-07T20:32:35.3376356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3376524Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3376654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3376663Z 2025-05-07T20:32:35.3376865Z self = 2025-05-07T20:32:35.3377624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3378117Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0173c60c0>} 2025-05-07T20:32:35.3378846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3379090Z context = 2025-05-07T20:32:35.3379095Z 2025-05-07T20:32:35.3379329Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3379588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3379694Z module_map=module_map) 2025-05-07T20:32:35.3379852Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3379951Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3380066Z E ^ 2025-05-07T20:32:35.3380410Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3380420Z 2025-05-07T20:32:35.3380821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3380826Z 2025-05-07T20:32:35.3380925Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3381149Z self=, 2025-05-07T20:32:35.3381226Z T=1, 2025-05-07T20:32:35.3381307Z D=7168, 2025-05-07T20:32:35.3381391Z scale_ub=None, 2025-05-07T20:32:35.3381477Z contiguous=False, 2025-05-07T20:32:35.3381558Z compiled=True, 2025-05-07T20:32:35.3381632Z ) 2025-05-07T20:32:35.3381844Z self = 2025-05-07T20:32:35.3382008Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3382017Z 2025-05-07T20:32:35.3382092Z @given( 2025-05-07T20:32:35.3382207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3382304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3382413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3382526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3382642Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3382715Z ) 2025-05-07T20:32:35.3382962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3383052Z def test_silu_mul_quant( 2025-05-07T20:32:35.3383128Z self, 2025-05-07T20:32:35.3383207Z T: int, 2025-05-07T20:32:35.3383282Z D: int, 2025-05-07T20:32:35.3383377Z scale_ub: Optional[float], 2025-05-07T20:32:35.3383465Z contiguous: bool, 2025-05-07T20:32:35.3383547Z compiled: bool, 2025-05-07T20:32:35.3383621Z ) -> None: 2025-05-07T20:32:35.3383721Z torch.manual_seed(2025) 2025-05-07T20:32:35.3383794Z 2025-05-07T20:32:35.3383960Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3384038Z 2025-05-07T20:32:35.3384128Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3384250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3384341Z x = x_sign * x_clamp 2025-05-07T20:32:35.3384419Z x0 = x[:, :D] 2025-05-07T20:32:35.3384497Z x1 = x[:, D:] 2025-05-07T20:32:35.3384567Z 2025-05-07T20:32:35.3384650Z if contiguous: 2025-05-07T20:32:35.3384741Z x0 = x0.contiguous() 2025-05-07T20:32:35.3384826Z x1 = x1.contiguous() 2025-05-07T20:32:35.3384894Z 2025-05-07T20:32:35.3384984Z if scale_ub is not None: 2025-05-07T20:32:35.3385087Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3385217Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3385295Z ) 2025-05-07T20:32:35.3385370Z else: 2025-05-07T20:32:35.3385462Z scale_ub_tensor = None 2025-05-07T20:32:35.3385535Z 2025-05-07T20:32:35.3385665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3385762Z op = silu_mul_quant 2025-05-07T20:32:35.3385844Z if compiled: 2025-05-07T20:32:35.3385939Z op = torch.compile(op) 2025-05-07T20:32:35.3386136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3386204Z 2025-05-07T20:32:35.3386290Z y_fp8, y_scale = fn() 2025-05-07T20:32:35.3386486Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:35.3386558Z 2025-05-07T20:32:35.3386693Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3386796Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:35.3386894Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:35.3387013Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:35.3387193Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3387268Z 2025-05-07T20:32:35.3387373Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:35.3387378Z 2025-05-07T20:32:35.3387469Z moe/activation_test.py:126: 2025-05-07T20:32:35.3387595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3387704Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:35.3387834Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:35.3388386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:35.3388488Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:35.3388837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3389056Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3389417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:35.3389666Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:35.3390037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:35.3390202Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:35.3390544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:35.3390615Z fn() 2025-05-07T20:32:35.3391006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:35.3391088Z self.fn.run( 2025-05-07T20:32:35.3391417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3391511Z kernel = self.compile( 2025-05-07T20:32:35.3391885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3392055Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3392183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3392189Z 2025-05-07T20:32:35.3392391Z self = 2025-05-07T20:32:35.3393153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3393650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0173c6de0>} 2025-05-07T20:32:35.3394381Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3394570Z context = 2025-05-07T20:32:35.3394574Z 2025-05-07T20:32:35.3394736Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3395115Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3395224Z module_map=module_map) 2025-05-07T20:32:35.3395382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3395483Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:35.3395575Z E ^ 2025-05-07T20:32:35.3395954Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3395998Z 2025-05-07T20:32:35.3396405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3396410Z 2025-05-07T20:32:35.3396510Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3396730Z self=, 2025-05-07T20:32:35.3396808Z T=1, 2025-05-07T20:32:35.3396884Z D=5120, 2025-05-07T20:32:35.3396968Z scale_ub=1200.0, 2025-05-07T20:32:35.3397052Z contiguous=False, 2025-05-07T20:32:35.3397138Z compiled=True, 2025-05-07T20:32:35.3397212Z ) 2025-05-07T20:32:35.3397423Z self = 2025-05-07T20:32:35.3397584Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3397589Z 2025-05-07T20:32:35.3397670Z @given( 2025-05-07T20:32:35.3397785Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3397891Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3398004Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3398120Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3398234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3398306Z ) 2025-05-07T20:32:35.3398548Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3398643Z def test_silu_mul_quant( 2025-05-07T20:32:35.3398714Z self, 2025-05-07T20:32:35.3398788Z T: int, 2025-05-07T20:32:35.3398868Z D: int, 2025-05-07T20:32:35.3398965Z scale_ub: Optional[float], 2025-05-07T20:32:35.3399056Z contiguous: bool, 2025-05-07T20:32:35.3399140Z compiled: bool, 2025-05-07T20:32:35.3399214Z ) -> None: 2025-05-07T20:32:35.3399307Z torch.manual_seed(2025) 2025-05-07T20:32:35.3399378Z 2025-05-07T20:32:35.3399543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3399623Z 2025-05-07T20:32:35.3399713Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3399836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3399927Z x = x_sign * x_clamp 2025-05-07T20:32:35.3400007Z x0 = x[:, :D] 2025-05-07T20:32:35.3400084Z x1 = x[:, D:] 2025-05-07T20:32:35.3400161Z 2025-05-07T20:32:35.3400246Z if contiguous: 2025-05-07T20:32:35.3400332Z x0 = x0.contiguous() 2025-05-07T20:32:35.3400423Z x1 = x1.contiguous() 2025-05-07T20:32:35.3400496Z 2025-05-07T20:32:35.3400585Z if scale_ub is not None: 2025-05-07T20:32:35.3400686Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3400817Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3400901Z ) 2025-05-07T20:32:35.3400975Z else: 2025-05-07T20:32:35.3401068Z scale_ub_tensor = None 2025-05-07T20:32:35.3401146Z 2025-05-07T20:32:35.3401275Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3401361Z op = silu_mul_quant 2025-05-07T20:32:35.3401448Z if compiled: 2025-05-07T20:32:35.3401544Z op = torch.compile(op) 2025-05-07T20:32:35.3401645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3401720Z 2025-05-07T20:32:35.3401806Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3401858Z 2025-05-07T20:32:35.3401957Z moe/activation_test.py:117: 2025-05-07T20:32:35.3402154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3402253Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3402355Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3402713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3402803Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3403288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3403424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3403776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3403995Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3404328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3404426Z kernel = self.compile( 2025-05-07T20:32:35.3404797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3404966Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3405095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3405100Z 2025-05-07T20:32:35.3405304Z self = 2025-05-07T20:32:35.3406066Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3406556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017c68540>} 2025-05-07T20:32:35.3407292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3407476Z context = 2025-05-07T20:32:35.3407481Z 2025-05-07T20:32:35.3407643Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3407902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3408004Z module_map=module_map) 2025-05-07T20:32:35.3408165Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3408262Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3408334Z E ^ 2025-05-07T20:32:35.3408687Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3408691Z 2025-05-07T20:32:35.3409097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3409102Z 2025-05-07T20:32:35.3409201Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3409421Z self=, 2025-05-07T20:32:35.3409497Z T=1, 2025-05-07T20:32:35.3409574Z D=5120, 2025-05-07T20:32:35.3409658Z scale_ub=1200.0, 2025-05-07T20:32:35.3409741Z contiguous=False, 2025-05-07T20:32:35.3409825Z compiled=False, 2025-05-07T20:32:35.3409896Z ) 2025-05-07T20:32:35.3410108Z self = 2025-05-07T20:32:35.3410275Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.3410280Z 2025-05-07T20:32:35.3410400Z @given( 2025-05-07T20:32:35.3410515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3410688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3410802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3410919Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3411029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3411102Z ) 2025-05-07T20:32:35.3411345Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3411436Z def test_silu_mul_quant( 2025-05-07T20:32:35.3411572Z self, 2025-05-07T20:32:35.3411651Z T: int, 2025-05-07T20:32:35.3411723Z D: int, 2025-05-07T20:32:35.3411820Z scale_ub: Optional[float], 2025-05-07T20:32:35.3411910Z contiguous: bool, 2025-05-07T20:32:35.3411993Z compiled: bool, 2025-05-07T20:32:35.3412065Z ) -> None: 2025-05-07T20:32:35.3412160Z torch.manual_seed(2025) 2025-05-07T20:32:35.3412235Z 2025-05-07T20:32:35.3412403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3412481Z 2025-05-07T20:32:35.3412569Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3412693Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3412779Z x = x_sign * x_clamp 2025-05-07T20:32:35.3412853Z x0 = x[:, :D] 2025-05-07T20:32:35.3412934Z x1 = x[:, D:] 2025-05-07T20:32:35.3413069Z 2025-05-07T20:32:35.3413150Z if contiguous: 2025-05-07T20:32:35.3413241Z x0 = x0.contiguous() 2025-05-07T20:32:35.3413332Z x1 = x1.contiguous() 2025-05-07T20:32:35.3413403Z 2025-05-07T20:32:35.3413494Z if scale_ub is not None: 2025-05-07T20:32:35.3413597Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3413730Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3413802Z ) 2025-05-07T20:32:35.3413875Z else: 2025-05-07T20:32:35.3413973Z scale_ub_tensor = None 2025-05-07T20:32:35.3414043Z 2025-05-07T20:32:35.3414173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3414262Z op = silu_mul_quant 2025-05-07T20:32:35.3414345Z if compiled: 2025-05-07T20:32:35.3414443Z op = torch.compile(op) 2025-05-07T20:32:35.3414549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3414618Z 2025-05-07T20:32:35.3414705Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3414712Z 2025-05-07T20:32:35.3414809Z moe/activation_test.py:117: 2025-05-07T20:32:35.3414934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3415034Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3415131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3415621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3415724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3416079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3416295Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3416629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3416718Z kernel = self.compile( 2025-05-07T20:32:35.3417094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3417266Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3417387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3417392Z 2025-05-07T20:32:35.3417594Z self = 2025-05-07T20:32:35.3418469Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3418967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017dbb560>} 2025-05-07T20:32:35.3419695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3419922Z context = 2025-05-07T20:32:35.3419927Z 2025-05-07T20:32:35.3420087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3420342Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3420450Z module_map=module_map) 2025-05-07T20:32:35.3420610Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3420705Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3420782Z E ^ 2025-05-07T20:32:35.3421126Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3421131Z 2025-05-07T20:32:35.3421536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3421543Z 2025-05-07T20:32:35.3421641Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3421857Z self=, 2025-05-07T20:32:35.3421935Z T=16384, 2025-05-07T20:32:35.3422007Z D=5120, 2025-05-07T20:32:35.3422088Z scale_ub=1200.0, 2025-05-07T20:32:35.3422179Z contiguous=False, 2025-05-07T20:32:35.3422259Z compiled=True, 2025-05-07T20:32:35.3422334Z ) 2025-05-07T20:32:35.3422555Z self = 2025-05-07T20:32:35.3422729Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3422733Z 2025-05-07T20:32:35.3422810Z @given( 2025-05-07T20:32:35.3422926Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3423021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3423134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3423250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3423359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3423435Z ) 2025-05-07T20:32:35.3423674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3423766Z def test_silu_mul_quant( 2025-05-07T20:32:35.3423838Z self, 2025-05-07T20:32:35.3423913Z T: int, 2025-05-07T20:32:35.3423991Z D: int, 2025-05-07T20:32:35.3424087Z scale_ub: Optional[float], 2025-05-07T20:32:35.3424179Z contiguous: bool, 2025-05-07T20:32:35.3424264Z compiled: bool, 2025-05-07T20:32:35.3424339Z ) -> None: 2025-05-07T20:32:35.3424431Z torch.manual_seed(2025) 2025-05-07T20:32:35.3424501Z 2025-05-07T20:32:35.3424664Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3424738Z 2025-05-07T20:32:35.3424828Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3424955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3425042Z x = x_sign * x_clamp 2025-05-07T20:32:35.3425118Z x0 = x[:, :D] 2025-05-07T20:32:35.3425194Z x1 = x[:, D:] 2025-05-07T20:32:35.3425270Z 2025-05-07T20:32:35.3425349Z if contiguous: 2025-05-07T20:32:35.3425435Z x0 = x0.contiguous() 2025-05-07T20:32:35.3425524Z x1 = x1.contiguous() 2025-05-07T20:32:35.3425640Z 2025-05-07T20:32:35.3425727Z if scale_ub is not None: 2025-05-07T20:32:35.3425904Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3426035Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3426111Z ) 2025-05-07T20:32:35.3426188Z else: 2025-05-07T20:32:35.3426279Z scale_ub_tensor = None 2025-05-07T20:32:35.3426346Z 2025-05-07T20:32:35.3426475Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3426562Z op = silu_mul_quant 2025-05-07T20:32:35.3426692Z if compiled: 2025-05-07T20:32:35.3426788Z op = torch.compile(op) 2025-05-07T20:32:35.3426889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3426958Z 2025-05-07T20:32:35.3427045Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3427050Z 2025-05-07T20:32:35.3427142Z moe/activation_test.py:117: 2025-05-07T20:32:35.3427272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3427369Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3427469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3427832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3427920Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3428403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3428501Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3428849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3429068Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3429400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3429503Z kernel = self.compile( 2025-05-07T20:32:35.3429882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3430054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3430186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3430191Z 2025-05-07T20:32:35.3430394Z self = 2025-05-07T20:32:35.3431159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3431657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1024c85e0>} 2025-05-07T20:32:35.3432392Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3432587Z context = 2025-05-07T20:32:35.3432591Z 2025-05-07T20:32:35.3432754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3433016Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3433126Z module_map=module_map) 2025-05-07T20:32:35.3433287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3433391Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3433469Z E ^ 2025-05-07T20:32:35.3433817Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3433877Z 2025-05-07T20:32:35.3434897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3434904Z 2025-05-07T20:32:35.3435008Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3435236Z self=, 2025-05-07T20:32:35.3435316Z T=2048, 2025-05-07T20:32:35.3435393Z D=7168, 2025-05-07T20:32:35.3435482Z scale_ub=1200.0, 2025-05-07T20:32:35.3435568Z contiguous=False, 2025-05-07T20:32:35.3435650Z compiled=True, 2025-05-07T20:32:35.3435769Z ) 2025-05-07T20:32:35.3435986Z self = 2025-05-07T20:32:35.3436165Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3436170Z 2025-05-07T20:32:35.3436245Z @given( 2025-05-07T20:32:35.3436363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3436472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3436585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3436706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3436824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3436900Z ) 2025-05-07T20:32:35.3437145Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3437250Z def test_silu_mul_quant( 2025-05-07T20:32:35.3437326Z self, 2025-05-07T20:32:35.3437409Z T: int, 2025-05-07T20:32:35.3437486Z D: int, 2025-05-07T20:32:35.3437583Z scale_ub: Optional[float], 2025-05-07T20:32:35.3437680Z contiguous: bool, 2025-05-07T20:32:35.3437765Z compiled: bool, 2025-05-07T20:32:35.3437842Z ) -> None: 2025-05-07T20:32:35.3437940Z torch.manual_seed(2025) 2025-05-07T20:32:35.3438014Z 2025-05-07T20:32:35.3438179Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3438267Z 2025-05-07T20:32:35.3438357Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3438485Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3438577Z x = x_sign * x_clamp 2025-05-07T20:32:35.3438656Z x0 = x[:, :D] 2025-05-07T20:32:35.3438740Z x1 = x[:, D:] 2025-05-07T20:32:35.3438811Z 2025-05-07T20:32:35.3438891Z if contiguous: 2025-05-07T20:32:35.3438988Z x0 = x0.contiguous() 2025-05-07T20:32:35.3439077Z x1 = x1.contiguous() 2025-05-07T20:32:35.3439151Z 2025-05-07T20:32:35.3439251Z if scale_ub is not None: 2025-05-07T20:32:35.3439357Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3439489Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3439571Z ) 2025-05-07T20:32:35.3439648Z else: 2025-05-07T20:32:35.3439740Z scale_ub_tensor = None 2025-05-07T20:32:35.3439821Z 2025-05-07T20:32:35.3439951Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3440039Z op = silu_mul_quant 2025-05-07T20:32:35.3440134Z if compiled: 2025-05-07T20:32:35.3440232Z op = torch.compile(op) 2025-05-07T20:32:35.3440343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3440415Z 2025-05-07T20:32:35.3440505Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3440510Z 2025-05-07T20:32:35.3440611Z moe/activation_test.py:117: 2025-05-07T20:32:35.3440737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3440840Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3440944Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3441306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3441405Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3441890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3442033Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3442487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3442709Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3443042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3443140Z kernel = self.compile( 2025-05-07T20:32:35.3443553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3443728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3443854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3443859Z 2025-05-07T20:32:35.3444062Z self = 2025-05-07T20:32:35.3444836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3445332Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10313a8e0>} 2025-05-07T20:32:35.3446116Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3446307Z context = 2025-05-07T20:32:35.3446311Z 2025-05-07T20:32:35.3446478Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3446738Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3446848Z module_map=module_map) 2025-05-07T20:32:35.3447014Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3450870Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3450961Z E ^ 2025-05-07T20:32:35.3451324Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3451329Z 2025-05-07T20:32:35.3451744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3451749Z 2025-05-07T20:32:35.3451854Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3452074Z self=, 2025-05-07T20:32:35.3452151Z T=1, 2025-05-07T20:32:35.3452229Z D=5120, 2025-05-07T20:32:35.3452312Z scale_ub=None, 2025-05-07T20:32:35.3452397Z contiguous=False, 2025-05-07T20:32:35.3452482Z compiled=False, 2025-05-07T20:32:35.3452552Z ) 2025-05-07T20:32:35.3452771Z self = 2025-05-07T20:32:35.3452937Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3452942Z 2025-05-07T20:32:35.3453073Z @given( 2025-05-07T20:32:35.3453193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3453291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3453406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3453524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3453636Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3453709Z ) 2025-05-07T20:32:35.3453953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3454043Z def test_silu_mul_quant( 2025-05-07T20:32:35.3454187Z self, 2025-05-07T20:32:35.3454264Z T: int, 2025-05-07T20:32:35.3454337Z D: int, 2025-05-07T20:32:35.3454506Z scale_ub: Optional[float], 2025-05-07T20:32:35.3454601Z contiguous: bool, 2025-05-07T20:32:35.3454684Z compiled: bool, 2025-05-07T20:32:35.3454764Z ) -> None: 2025-05-07T20:32:35.3454856Z torch.manual_seed(2025) 2025-05-07T20:32:35.3454926Z 2025-05-07T20:32:35.3455093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3455163Z 2025-05-07T20:32:35.3455291Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3455417Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3455503Z x = x_sign * x_clamp 2025-05-07T20:32:35.3455581Z x0 = x[:, :D] 2025-05-07T20:32:35.3455661Z x1 = x[:, D:] 2025-05-07T20:32:35.3455729Z 2025-05-07T20:32:35.3455811Z if contiguous: 2025-05-07T20:32:35.3455902Z x0 = x0.contiguous() 2025-05-07T20:32:35.3455991Z x1 = x1.contiguous() 2025-05-07T20:32:35.3456062Z 2025-05-07T20:32:35.3456155Z if scale_ub is not None: 2025-05-07T20:32:35.3456257Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3456391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3456463Z ) 2025-05-07T20:32:35.3456538Z else: 2025-05-07T20:32:35.3456633Z scale_ub_tensor = None 2025-05-07T20:32:35.3456704Z 2025-05-07T20:32:35.3456830Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3456924Z op = silu_mul_quant 2025-05-07T20:32:35.3457005Z if compiled: 2025-05-07T20:32:35.3457102Z op = torch.compile(op) 2025-05-07T20:32:35.3457208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3457277Z 2025-05-07T20:32:35.3457369Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3457373Z 2025-05-07T20:32:35.3457471Z moe/activation_test.py:117: 2025-05-07T20:32:35.3457595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3457702Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3457798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3458292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3458390Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3458739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3458962Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3459524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3459646Z kernel = self.compile( 2025-05-07T20:32:35.3460025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3460202Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3460325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3460333Z 2025-05-07T20:32:35.3460534Z self = 2025-05-07T20:32:35.3461291Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3461790Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10333c7c0>} 2025-05-07T20:32:35.3462518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3462925Z context = 2025-05-07T20:32:35.3462931Z 2025-05-07T20:32:35.3463093Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3463346Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3463452Z module_map=module_map) 2025-05-07T20:32:35.3463610Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3463766Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3463844Z E ^ 2025-05-07T20:32:35.3464192Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3464196Z 2025-05-07T20:32:35.3464601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3464608Z 2025-05-07T20:32:35.3464707Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3464929Z self=, 2025-05-07T20:32:35.3465009Z T=4096, 2025-05-07T20:32:35.3465083Z D=7168, 2025-05-07T20:32:35.3465164Z scale_ub=1200.0, 2025-05-07T20:32:35.3465246Z contiguous=False, 2025-05-07T20:32:35.3465324Z compiled=False, 2025-05-07T20:32:35.3465400Z ) 2025-05-07T20:32:35.3465611Z self = 2025-05-07T20:32:35.3465786Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.3465791Z 2025-05-07T20:32:35.3465871Z @given( 2025-05-07T20:32:35.3465987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3466085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3466199Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3466316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3466431Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3466504Z ) 2025-05-07T20:32:35.3466744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3466836Z def test_silu_mul_quant( 2025-05-07T20:32:35.3466906Z self, 2025-05-07T20:32:35.3466980Z T: int, 2025-05-07T20:32:35.3467055Z D: int, 2025-05-07T20:32:35.3467152Z scale_ub: Optional[float], 2025-05-07T20:32:35.3467236Z contiguous: bool, 2025-05-07T20:32:35.3467326Z compiled: bool, 2025-05-07T20:32:35.3467403Z ) -> None: 2025-05-07T20:32:35.3467494Z torch.manual_seed(2025) 2025-05-07T20:32:35.3467569Z 2025-05-07T20:32:35.3467734Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3467807Z 2025-05-07T20:32:35.3467897Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3468021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3468109Z x = x_sign * x_clamp 2025-05-07T20:32:35.3468190Z x0 = x[:, :D] 2025-05-07T20:32:35.3468267Z x1 = x[:, D:] 2025-05-07T20:32:35.3468338Z 2025-05-07T20:32:35.3468422Z if contiguous: 2025-05-07T20:32:35.3468509Z x0 = x0.contiguous() 2025-05-07T20:32:35.3468602Z x1 = x1.contiguous() 2025-05-07T20:32:35.3468674Z 2025-05-07T20:32:35.3468764Z if scale_ub is not None: 2025-05-07T20:32:35.3468868Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3469000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3469070Z ) 2025-05-07T20:32:35.3469147Z else: 2025-05-07T20:32:35.3469238Z scale_ub_tensor = None 2025-05-07T20:32:35.3469311Z 2025-05-07T20:32:35.3469436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3469523Z op = silu_mul_quant 2025-05-07T20:32:35.3469656Z if compiled: 2025-05-07T20:32:35.3469752Z op = torch.compile(op) 2025-05-07T20:32:35.3469922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3470001Z 2025-05-07T20:32:35.3470088Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3470093Z 2025-05-07T20:32:35.3470191Z moe/activation_test.py:117: 2025-05-07T20:32:35.3470318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3470417Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3470517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3471046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3471142Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3471494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3471713Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3472047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3472142Z kernel = self.compile( 2025-05-07T20:32:35.3472513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3472687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3472809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3472816Z 2025-05-07T20:32:35.3473017Z self = 2025-05-07T20:32:35.3473782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3474278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb1085b5620>} 2025-05-07T20:32:35.3475008Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3475195Z context = 2025-05-07T20:32:35.3475199Z 2025-05-07T20:32:35.3475362Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3475616Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3475718Z module_map=module_map) 2025-05-07T20:32:35.3475879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3475976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3476052Z E ^ 2025-05-07T20:32:35.3476405Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3476410Z 2025-05-07T20:32:35.3476812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3476817Z 2025-05-07T20:32:35.3476918Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3477135Z self=, 2025-05-07T20:32:35.3477213Z T=16384, 2025-05-07T20:32:35.3477290Z D=7168, 2025-05-07T20:32:35.3477368Z scale_ub=None, 2025-05-07T20:32:35.3477449Z contiguous=True, 2025-05-07T20:32:35.3477531Z compiled=True, 2025-05-07T20:32:35.3477599Z ) 2025-05-07T20:32:35.3477811Z self = 2025-05-07T20:32:35.3477979Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3478028Z 2025-05-07T20:32:35.3478104Z @given( 2025-05-07T20:32:35.3478317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3478415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3478526Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3478641Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3478751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3478823Z ) 2025-05-07T20:32:35.3479066Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3479197Z def test_silu_mul_quant( 2025-05-07T20:32:35.3479275Z self, 2025-05-07T20:32:35.3479351Z T: int, 2025-05-07T20:32:35.3479427Z D: int, 2025-05-07T20:32:35.3479527Z scale_ub: Optional[float], 2025-05-07T20:32:35.3479613Z contiguous: bool, 2025-05-07T20:32:35.3479697Z compiled: bool, 2025-05-07T20:32:35.3479779Z ) -> None: 2025-05-07T20:32:35.3479873Z torch.manual_seed(2025) 2025-05-07T20:32:35.3479944Z 2025-05-07T20:32:35.3480115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3480188Z 2025-05-07T20:32:35.3480279Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3480405Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3480490Z x = x_sign * x_clamp 2025-05-07T20:32:35.3480566Z x0 = x[:, :D] 2025-05-07T20:32:35.3480647Z x1 = x[:, D:] 2025-05-07T20:32:35.3480718Z 2025-05-07T20:32:35.3480802Z if contiguous: 2025-05-07T20:32:35.3480892Z x0 = x0.contiguous() 2025-05-07T20:32:35.3480978Z x1 = x1.contiguous() 2025-05-07T20:32:35.3481049Z 2025-05-07T20:32:35.3481136Z if scale_ub is not None: 2025-05-07T20:32:35.3481237Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3481369Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3481444Z ) 2025-05-07T20:32:35.3481520Z else: 2025-05-07T20:32:35.3481615Z scale_ub_tensor = None 2025-05-07T20:32:35.3481693Z 2025-05-07T20:32:35.3481818Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3481907Z op = silu_mul_quant 2025-05-07T20:32:35.3481992Z if compiled: 2025-05-07T20:32:35.3482091Z op = torch.compile(op) 2025-05-07T20:32:35.3482195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3482265Z 2025-05-07T20:32:35.3482363Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3482367Z 2025-05-07T20:32:35.3482462Z moe/activation_test.py:117: 2025-05-07T20:32:35.3482592Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3482695Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3482790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3483149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3483244Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3483731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3483828Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3484177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3484395Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3484730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3484821Z kernel = self.compile( 2025-05-07T20:32:35.3485192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3485367Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3485540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3485616Z 2025-05-07T20:32:35.3485853Z self = 2025-05-07T20:32:35.3486625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3487119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10852c900>} 2025-05-07T20:32:35.3487884Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3488074Z context = 2025-05-07T20:32:35.3488079Z 2025-05-07T20:32:35.3488248Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3488503Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3488609Z module_map=module_map) 2025-05-07T20:32:35.3488763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3488859Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3488939Z E ^ 2025-05-07T20:32:35.3489288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3489293Z 2025-05-07T20:32:35.3489694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3489702Z 2025-05-07T20:32:35.3489800Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3490019Z self=, 2025-05-07T20:32:35.3490099Z T=4096, 2025-05-07T20:32:35.3490178Z D=5120, 2025-05-07T20:32:35.3490259Z scale_ub=None, 2025-05-07T20:32:35.3490351Z contiguous=False, 2025-05-07T20:32:35.3490431Z compiled=True, 2025-05-07T20:32:35.3490504Z ) 2025-05-07T20:32:35.3490719Z self = 2025-05-07T20:32:35.3490885Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3490893Z 2025-05-07T20:32:35.3490967Z @given( 2025-05-07T20:32:35.3491082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3491177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3491291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3491405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3491514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3491592Z ) 2025-05-07T20:32:35.3491834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3491921Z def test_silu_mul_quant( 2025-05-07T20:32:35.3491998Z self, 2025-05-07T20:32:35.3492072Z T: int, 2025-05-07T20:32:35.3492145Z D: int, 2025-05-07T20:32:35.3492246Z scale_ub: Optional[float], 2025-05-07T20:32:35.3492330Z contiguous: bool, 2025-05-07T20:32:35.3492415Z compiled: bool, 2025-05-07T20:32:35.3492487Z ) -> None: 2025-05-07T20:32:35.3492581Z torch.manual_seed(2025) 2025-05-07T20:32:35.3492656Z 2025-05-07T20:32:35.3492818Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3492890Z 2025-05-07T20:32:35.3492985Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3493174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3493258Z x = x_sign * x_clamp 2025-05-07T20:32:35.3493389Z x0 = x[:, :D] 2025-05-07T20:32:35.3493467Z x1 = x[:, D:] 2025-05-07T20:32:35.3493536Z 2025-05-07T20:32:35.3493695Z if contiguous: 2025-05-07T20:32:35.3493785Z x0 = x0.contiguous() 2025-05-07T20:32:35.3493877Z x1 = x1.contiguous() 2025-05-07T20:32:35.3493945Z 2025-05-07T20:32:35.3494032Z if scale_ub is not None: 2025-05-07T20:32:35.3494136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3494267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3494342Z ) 2025-05-07T20:32:35.3494462Z else: 2025-05-07T20:32:35.3494554Z scale_ub_tensor = None 2025-05-07T20:32:35.3494621Z 2025-05-07T20:32:35.3494750Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3494838Z op = silu_mul_quant 2025-05-07T20:32:35.3494917Z if compiled: 2025-05-07T20:32:35.3495014Z op = torch.compile(op) 2025-05-07T20:32:35.3495119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3495191Z 2025-05-07T20:32:35.3495281Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3495292Z 2025-05-07T20:32:35.3495386Z moe/activation_test.py:117: 2025-05-07T20:32:35.3495518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3495618Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3495721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3496124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3496215Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3496696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3496794Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3497142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3497362Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3497697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3497787Z kernel = self.compile( 2025-05-07T20:32:35.3498162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3498333Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3498461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3498465Z 2025-05-07T20:32:35.3498664Z self = 2025-05-07T20:32:35.3499420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3499923Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017a034c0>} 2025-05-07T20:32:35.3500648Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3500834Z context = 2025-05-07T20:32:35.3500842Z 2025-05-07T20:32:35.3501003Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3501255Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3501360Z module_map=module_map) 2025-05-07T20:32:35.3501516Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3501664Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3501738Z E ^ 2025-05-07T20:32:35.3502154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3502159Z 2025-05-07T20:32:35.3502567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3502572Z 2025-05-07T20:32:35.3502669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3502890Z self=, 2025-05-07T20:32:35.3503004Z T=4096, 2025-05-07T20:32:35.3503078Z D=5120, 2025-05-07T20:32:35.3503161Z scale_ub=1200.0, 2025-05-07T20:32:35.3503245Z contiguous=False, 2025-05-07T20:32:35.3503329Z compiled=False, 2025-05-07T20:32:35.3503404Z ) 2025-05-07T20:32:35.3503617Z self = 2025-05-07T20:32:35.3503791Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.3503795Z 2025-05-07T20:32:35.3503877Z @given( 2025-05-07T20:32:35.3503992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3504094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3504206Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3504320Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3504433Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3504508Z ) 2025-05-07T20:32:35.3504746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3504837Z def test_silu_mul_quant( 2025-05-07T20:32:35.3504913Z self, 2025-05-07T20:32:35.3504986Z T: int, 2025-05-07T20:32:35.3505061Z D: int, 2025-05-07T20:32:35.3505157Z scale_ub: Optional[float], 2025-05-07T20:32:35.3505242Z contiguous: bool, 2025-05-07T20:32:35.3505332Z compiled: bool, 2025-05-07T20:32:35.3505421Z ) -> None: 2025-05-07T20:32:35.3505527Z torch.manual_seed(2025) 2025-05-07T20:32:35.3505620Z 2025-05-07T20:32:35.3505789Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3505867Z 2025-05-07T20:32:35.3505956Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3506078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3506163Z x = x_sign * x_clamp 2025-05-07T20:32:35.3506238Z x0 = x[:, :D] 2025-05-07T20:32:35.3506317Z x1 = x[:, D:] 2025-05-07T20:32:35.3506387Z 2025-05-07T20:32:35.3506467Z if contiguous: 2025-05-07T20:32:35.3506552Z x0 = x0.contiguous() 2025-05-07T20:32:35.3506642Z x1 = x1.contiguous() 2025-05-07T20:32:35.3506713Z 2025-05-07T20:32:35.3506799Z if scale_ub is not None: 2025-05-07T20:32:35.3506904Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3507037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3507115Z ) 2025-05-07T20:32:35.3507193Z else: 2025-05-07T20:32:35.3507290Z scale_ub_tensor = None 2025-05-07T20:32:35.3507363Z 2025-05-07T20:32:35.3507489Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3507575Z op = silu_mul_quant 2025-05-07T20:32:35.3507659Z if compiled: 2025-05-07T20:32:35.3507754Z op = torch.compile(op) 2025-05-07T20:32:35.3507857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3507933Z 2025-05-07T20:32:35.3508019Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3508024Z 2025-05-07T20:32:35.3508120Z moe/activation_test.py:117: 2025-05-07T20:32:35.3508244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3508342Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3508441Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3509075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3509168Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3509520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3509736Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3510079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3510208Z kernel = self.compile( 2025-05-07T20:32:35.3510581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3510752Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3510875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3510882Z 2025-05-07T20:32:35.3511084Z self = 2025-05-07T20:32:35.3511846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3512342Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017a02840>} 2025-05-07T20:32:35.3513073Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3513259Z context = 2025-05-07T20:32:35.3513264Z 2025-05-07T20:32:35.3513428Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3513688Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3513791Z module_map=module_map) 2025-05-07T20:32:35.3513951Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3514047Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3514118Z E ^ 2025-05-07T20:32:35.3514465Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3514473Z 2025-05-07T20:32:35.3514878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3514882Z 2025-05-07T20:32:35.3514982Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3515198Z self=, 2025-05-07T20:32:35.3515274Z T=4096, 2025-05-07T20:32:35.3515350Z D=5120, 2025-05-07T20:32:35.3515431Z scale_ub=1200.0, 2025-05-07T20:32:35.3515525Z contiguous=False, 2025-05-07T20:32:35.3515610Z compiled=True, 2025-05-07T20:32:35.3515700Z ) 2025-05-07T20:32:35.3515945Z self = 2025-05-07T20:32:35.3516116Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3516121Z 2025-05-07T20:32:35.3516193Z @given( 2025-05-07T20:32:35.3516312Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3516412Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3516522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3516638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3516747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3516818Z ) 2025-05-07T20:32:35.3517056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3517192Z def test_silu_mul_quant( 2025-05-07T20:32:35.3517270Z self, 2025-05-07T20:32:35.3517421Z T: int, 2025-05-07T20:32:35.3517496Z D: int, 2025-05-07T20:32:35.3517594Z scale_ub: Optional[float], 2025-05-07T20:32:35.3517680Z contiguous: bool, 2025-05-07T20:32:35.3517761Z compiled: bool, 2025-05-07T20:32:35.3517839Z ) -> None: 2025-05-07T20:32:35.3517929Z torch.manual_seed(2025) 2025-05-07T20:32:35.3518001Z 2025-05-07T20:32:35.3518171Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3518285Z 2025-05-07T20:32:35.3518379Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3518501Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3518586Z x = x_sign * x_clamp 2025-05-07T20:32:35.3518665Z x0 = x[:, :D] 2025-05-07T20:32:35.3518740Z x1 = x[:, D:] 2025-05-07T20:32:35.3518812Z 2025-05-07T20:32:35.3518897Z if contiguous: 2025-05-07T20:32:35.3518988Z x0 = x0.contiguous() 2025-05-07T20:32:35.3519082Z x1 = x1.contiguous() 2025-05-07T20:32:35.3519153Z 2025-05-07T20:32:35.3519240Z if scale_ub is not None: 2025-05-07T20:32:35.3519341Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3519472Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3519546Z ) 2025-05-07T20:32:35.3519620Z else: 2025-05-07T20:32:35.3519714Z scale_ub_tensor = None 2025-05-07T20:32:35.3519785Z 2025-05-07T20:32:35.3519914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3520001Z op = silu_mul_quant 2025-05-07T20:32:35.3520081Z if compiled: 2025-05-07T20:32:35.3520182Z op = torch.compile(op) 2025-05-07T20:32:35.3520284Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3520353Z 2025-05-07T20:32:35.3520447Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3520451Z 2025-05-07T20:32:35.3520544Z moe/activation_test.py:117: 2025-05-07T20:32:35.3520674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3520774Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3520870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3521235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3521324Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3521806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3521908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3522255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3522472Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3522808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3522901Z kernel = self.compile( 2025-05-07T20:32:35.3523277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3523446Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3523569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3523573Z 2025-05-07T20:32:35.3523782Z self = 2025-05-07T20:32:35.3524539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3525035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017a016c0>} 2025-05-07T20:32:35.3525891Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3526110Z context = 2025-05-07T20:32:35.3526116Z 2025-05-07T20:32:35.3526293Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3526587Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3526696Z module_map=module_map) 2025-05-07T20:32:35.3526852Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3526945Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3527022Z E ^ 2025-05-07T20:32:35.3527369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3527378Z 2025-05-07T20:32:35.3527784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3527789Z 2025-05-07T20:32:35.3527887Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3528103Z self=, 2025-05-07T20:32:35.3528180Z T=2048, 2025-05-07T20:32:35.3528255Z D=7168, 2025-05-07T20:32:35.3528333Z scale_ub=1200.0, 2025-05-07T20:32:35.3528419Z contiguous=False, 2025-05-07T20:32:35.3528499Z compiled=False, 2025-05-07T20:32:35.3528570Z ) 2025-05-07T20:32:35.3528783Z self = 2025-05-07T20:32:35.3528952Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.3528959Z 2025-05-07T20:32:35.3529038Z @given( 2025-05-07T20:32:35.3529153Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3529256Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3529370Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3529483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3529593Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3529668Z ) 2025-05-07T20:32:35.3529907Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3530001Z def test_silu_mul_quant( 2025-05-07T20:32:35.3530075Z self, 2025-05-07T20:32:35.3530149Z T: int, 2025-05-07T20:32:35.3530224Z D: int, 2025-05-07T20:32:35.3530320Z scale_ub: Optional[float], 2025-05-07T20:32:35.3530406Z contiguous: bool, 2025-05-07T20:32:35.3530493Z compiled: bool, 2025-05-07T20:32:35.3530569Z ) -> None: 2025-05-07T20:32:35.3530662Z torch.manual_seed(2025) 2025-05-07T20:32:35.3530737Z 2025-05-07T20:32:35.3530906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3530978Z 2025-05-07T20:32:35.3531069Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3531189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3531279Z x = x_sign * x_clamp 2025-05-07T20:32:35.3531357Z x0 = x[:, :D] 2025-05-07T20:32:35.3531435Z x1 = x[:, D:] 2025-05-07T20:32:35.3531510Z 2025-05-07T20:32:35.3531591Z if contiguous: 2025-05-07T20:32:35.3531685Z x0 = x0.contiguous() 2025-05-07T20:32:35.3531775Z x1 = x1.contiguous() 2025-05-07T20:32:35.3531844Z 2025-05-07T20:32:35.3531930Z if scale_ub is not None: 2025-05-07T20:32:35.3532037Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3532165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3532283Z ) 2025-05-07T20:32:35.3532360Z else: 2025-05-07T20:32:35.3532451Z scale_ub_tensor = None 2025-05-07T20:32:35.3532520Z 2025-05-07T20:32:35.3532720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3532807Z op = silu_mul_quant 2025-05-07T20:32:35.3532891Z if compiled: 2025-05-07T20:32:35.3533033Z op = torch.compile(op) 2025-05-07T20:32:35.3533136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3533206Z 2025-05-07T20:32:35.3533293Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3533361Z 2025-05-07T20:32:35.3533453Z moe/activation_test.py:117: 2025-05-07T20:32:35.3533581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3533680Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3533777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3534265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3534362Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3534722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3534938Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3535269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3535362Z kernel = self.compile( 2025-05-07T20:32:35.3535733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3535911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3536034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3536038Z 2025-05-07T20:32:35.3536239Z self = 2025-05-07T20:32:35.3537005Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3537497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017e471a0>} 2025-05-07T20:32:35.3538231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3538417Z context = 2025-05-07T20:32:35.3538422Z 2025-05-07T20:32:35.3538582Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3538841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3538950Z module_map=module_map) 2025-05-07T20:32:35.3539115Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3539211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3539287Z E ^ 2025-05-07T20:32:35.3539636Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3539640Z 2025-05-07T20:32:35.3540043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3540050Z 2025-05-07T20:32:35.3540151Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3540367Z self=, 2025-05-07T20:32:35.3540440Z T=1, 2025-05-07T20:32:35.3540518Z D=7168, 2025-05-07T20:32:35.3540595Z scale_ub=None, 2025-05-07T20:32:35.3540724Z contiguous=True, 2025-05-07T20:32:35.3540808Z compiled=False, 2025-05-07T20:32:35.3540878Z ) 2025-05-07T20:32:35.3541164Z self = 2025-05-07T20:32:35.3541327Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3541331Z 2025-05-07T20:32:35.3541403Z @given( 2025-05-07T20:32:35.3541522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3541620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3541730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3541884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3541993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3542064Z ) 2025-05-07T20:32:35.3542306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3542397Z def test_silu_mul_quant( 2025-05-07T20:32:35.3542473Z self, 2025-05-07T20:32:35.3542549Z T: int, 2025-05-07T20:32:35.3542622Z D: int, 2025-05-07T20:32:35.3542720Z scale_ub: Optional[float], 2025-05-07T20:32:35.3542814Z contiguous: bool, 2025-05-07T20:32:35.3542899Z compiled: bool, 2025-05-07T20:32:35.3542977Z ) -> None: 2025-05-07T20:32:35.3543070Z torch.manual_seed(2025) 2025-05-07T20:32:35.3543141Z 2025-05-07T20:32:35.3543310Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3543381Z 2025-05-07T20:32:35.3543471Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3543599Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3543683Z x = x_sign * x_clamp 2025-05-07T20:32:35.3543758Z x0 = x[:, :D] 2025-05-07T20:32:35.3543838Z x1 = x[:, D:] 2025-05-07T20:32:35.3543909Z 2025-05-07T20:32:35.3543988Z if contiguous: 2025-05-07T20:32:35.3544082Z x0 = x0.contiguous() 2025-05-07T20:32:35.3544172Z x1 = x1.contiguous() 2025-05-07T20:32:35.3544247Z 2025-05-07T20:32:35.3544334Z if scale_ub is not None: 2025-05-07T20:32:35.3544440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3544572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3544649Z ) 2025-05-07T20:32:35.3544724Z else: 2025-05-07T20:32:35.3544819Z scale_ub_tensor = None 2025-05-07T20:32:35.3544892Z 2025-05-07T20:32:35.3545018Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3545109Z op = silu_mul_quant 2025-05-07T20:32:35.3545191Z if compiled: 2025-05-07T20:32:35.3545288Z op = torch.compile(op) 2025-05-07T20:32:35.3545393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3545460Z 2025-05-07T20:32:35.3545550Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3545554Z 2025-05-07T20:32:35.3545648Z moe/activation_test.py:117: 2025-05-07T20:32:35.3545775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3545877Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3545974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3546464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3546560Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3546909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3547134Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3547466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3547555Z kernel = self.compile( 2025-05-07T20:32:35.3547932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3548150Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3548344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3548352Z 2025-05-07T20:32:35.3548552Z self = 2025-05-07T20:32:35.3549307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3549842Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017e45bc0>} 2025-05-07T20:32:35.3550571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3550761Z context = 2025-05-07T20:32:35.3550770Z 2025-05-07T20:32:35.3550928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3551182Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3551288Z module_map=module_map) 2025-05-07T20:32:35.3551444Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3551545Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3551623Z E ^ 2025-05-07T20:32:35.3551968Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3551973Z 2025-05-07T20:32:35.3552378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3552385Z 2025-05-07T20:32:35.3552485Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3552708Z self=, 2025-05-07T20:32:35.3552787Z T=16384, 2025-05-07T20:32:35.3552862Z D=7168, 2025-05-07T20:32:35.3552947Z scale_ub=1200.0, 2025-05-07T20:32:35.3553029Z contiguous=False, 2025-05-07T20:32:35.3553109Z compiled=True, 2025-05-07T20:32:35.3553182Z ) 2025-05-07T20:32:35.3553396Z self = 2025-05-07T20:32:35.3553570Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3553577Z 2025-05-07T20:32:35.3553655Z @given( 2025-05-07T20:32:35.3553771Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3553870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3553982Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3554094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3554212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3554282Z ) 2025-05-07T20:32:35.3554525Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3554620Z def test_silu_mul_quant( 2025-05-07T20:32:35.3554694Z self, 2025-05-07T20:32:35.3554771Z T: int, 2025-05-07T20:32:35.3554849Z D: int, 2025-05-07T20:32:35.3554946Z scale_ub: Optional[float], 2025-05-07T20:32:35.3555031Z contiguous: bool, 2025-05-07T20:32:35.3555116Z compiled: bool, 2025-05-07T20:32:35.3555190Z ) -> None: 2025-05-07T20:32:35.3555285Z torch.manual_seed(2025) 2025-05-07T20:32:35.3555358Z 2025-05-07T20:32:35.3555519Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3555594Z 2025-05-07T20:32:35.3555683Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3555809Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3555945Z x = x_sign * x_clamp 2025-05-07T20:32:35.3556025Z x0 = x[:, :D] 2025-05-07T20:32:35.3556178Z x1 = x[:, D:] 2025-05-07T20:32:35.3556250Z 2025-05-07T20:32:35.3556330Z if contiguous: 2025-05-07T20:32:35.3556419Z x0 = x0.contiguous() 2025-05-07T20:32:35.3556512Z x1 = x1.contiguous() 2025-05-07T20:32:35.3556580Z 2025-05-07T20:32:35.3556668Z if scale_ub is not None: 2025-05-07T20:32:35.3556776Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3556907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3557020Z ) 2025-05-07T20:32:35.3557097Z else: 2025-05-07T20:32:35.3557189Z scale_ub_tensor = None 2025-05-07T20:32:35.3557260Z 2025-05-07T20:32:35.3557384Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3557472Z op = silu_mul_quant 2025-05-07T20:32:35.3557558Z if compiled: 2025-05-07T20:32:35.3557657Z op = torch.compile(op) 2025-05-07T20:32:35.3557759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3557839Z 2025-05-07T20:32:35.3557926Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3557930Z 2025-05-07T20:32:35.3558023Z moe/activation_test.py:117: 2025-05-07T20:32:35.3558152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3558247Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3558350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3558713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3558803Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3559496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3559616Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3559968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3560196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3560525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3560616Z kernel = self.compile( 2025-05-07T20:32:35.3560988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3561160Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3561284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3561288Z 2025-05-07T20:32:35.3561486Z self = 2025-05-07T20:32:35.3562251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3562747Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017e44b80>} 2025-05-07T20:32:35.3563475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3563670Z context = 2025-05-07T20:32:35.3563675Z 2025-05-07T20:32:35.3563835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3564090Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3564193Z module_map=module_map) 2025-05-07T20:32:35.3564434Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3564666Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3564739Z E ^ 2025-05-07T20:32:35.3565091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3565096Z 2025-05-07T20:32:35.3565501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3565506Z 2025-05-07T20:32:35.3565663Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3565883Z self=, 2025-05-07T20:32:35.3565958Z T=1, 2025-05-07T20:32:35.3566032Z D=7168, 2025-05-07T20:32:35.3566119Z scale_ub=None, 2025-05-07T20:32:35.3566204Z contiguous=False, 2025-05-07T20:32:35.3566282Z compiled=False, 2025-05-07T20:32:35.3566356Z ) 2025-05-07T20:32:35.3566573Z self = 2025-05-07T20:32:35.3566744Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3566749Z 2025-05-07T20:32:35.3566823Z @given( 2025-05-07T20:32:35.3566939Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3567041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3570827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3570959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3571081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3571155Z ) 2025-05-07T20:32:35.3571396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3571491Z def test_silu_mul_quant( 2025-05-07T20:32:35.3571566Z self, 2025-05-07T20:32:35.3571642Z T: int, 2025-05-07T20:32:35.3571715Z D: int, 2025-05-07T20:32:35.3571814Z scale_ub: Optional[float], 2025-05-07T20:32:35.3571904Z contiguous: bool, 2025-05-07T20:32:35.3571987Z compiled: bool, 2025-05-07T20:32:35.3572067Z ) -> None: 2025-05-07T20:32:35.3572162Z torch.manual_seed(2025) 2025-05-07T20:32:35.3572230Z 2025-05-07T20:32:35.3572395Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3572466Z 2025-05-07T20:32:35.3572554Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3572676Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3572766Z x = x_sign * x_clamp 2025-05-07T20:32:35.3572851Z x0 = x[:, :D] 2025-05-07T20:32:35.3572930Z x1 = x[:, D:] 2025-05-07T20:32:35.3573065Z 2025-05-07T20:32:35.3573146Z if contiguous: 2025-05-07T20:32:35.3573238Z x0 = x0.contiguous() 2025-05-07T20:32:35.3573324Z x1 = x1.contiguous() 2025-05-07T20:32:35.3573393Z 2025-05-07T20:32:35.3573483Z if scale_ub is not None: 2025-05-07T20:32:35.3573588Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3573719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3573796Z ) 2025-05-07T20:32:35.3573871Z else: 2025-05-07T20:32:35.3573962Z scale_ub_tensor = None 2025-05-07T20:32:35.3574036Z 2025-05-07T20:32:35.3574164Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3574252Z op = silu_mul_quant 2025-05-07T20:32:35.3574336Z if compiled: 2025-05-07T20:32:35.3574431Z op = torch.compile(op) 2025-05-07T20:32:35.3574539Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3574610Z 2025-05-07T20:32:35.3574697Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3574702Z 2025-05-07T20:32:35.3574797Z moe/activation_test.py:117: 2025-05-07T20:32:35.3574924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3575091Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3575191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3575754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3575855Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3576207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3576427Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3576800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3576890Z kernel = self.compile( 2025-05-07T20:32:35.3577263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3577437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3577562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3577567Z 2025-05-07T20:32:35.3577778Z self = 2025-05-07T20:32:35.3578541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3579034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10243b9c0>} 2025-05-07T20:32:35.3579771Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3579956Z context = 2025-05-07T20:32:35.3579963Z 2025-05-07T20:32:35.3580127Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3580385Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3580493Z module_map=module_map) 2025-05-07T20:32:35.3580650Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3580747Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3580825Z E ^ 2025-05-07T20:32:35.3581171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3581179Z 2025-05-07T20:32:35.3581580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3581585Z 2025-05-07T20:32:35.3581689Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3581906Z self=, 2025-05-07T20:32:35.3581987Z T=2048, 2025-05-07T20:32:35.3582063Z D=7168, 2025-05-07T20:32:35.3582143Z scale_ub=None, 2025-05-07T20:32:35.3582228Z contiguous=False, 2025-05-07T20:32:35.3582306Z compiled=True, 2025-05-07T20:32:35.3582378Z ) 2025-05-07T20:32:35.3582594Z self = 2025-05-07T20:32:35.3582760Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3582764Z 2025-05-07T20:32:35.3582839Z @given( 2025-05-07T20:32:35.3582960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3583057Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3583172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3583285Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3583396Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3583514Z ) 2025-05-07T20:32:35.3583754Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3583917Z def test_silu_mul_quant( 2025-05-07T20:32:35.3583995Z self, 2025-05-07T20:32:35.3584070Z T: int, 2025-05-07T20:32:35.3584146Z D: int, 2025-05-07T20:32:35.3584243Z scale_ub: Optional[float], 2025-05-07T20:32:35.3584330Z contiguous: bool, 2025-05-07T20:32:35.3584414Z compiled: bool, 2025-05-07T20:32:35.3584493Z ) -> None: 2025-05-07T20:32:35.3584584Z torch.manual_seed(2025) 2025-05-07T20:32:35.3584699Z 2025-05-07T20:32:35.3584863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3584939Z 2025-05-07T20:32:35.3585032Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3585154Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3585242Z x = x_sign * x_clamp 2025-05-07T20:32:35.3585322Z x0 = x[:, :D] 2025-05-07T20:32:35.3585403Z x1 = x[:, D:] 2025-05-07T20:32:35.3585473Z 2025-05-07T20:32:35.3585557Z if contiguous: 2025-05-07T20:32:35.3585649Z x0 = x0.contiguous() 2025-05-07T20:32:35.3585733Z x1 = x1.contiguous() 2025-05-07T20:32:35.3585806Z 2025-05-07T20:32:35.3585893Z if scale_ub is not None: 2025-05-07T20:32:35.3586000Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3586128Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3586200Z ) 2025-05-07T20:32:35.3586277Z else: 2025-05-07T20:32:35.3586371Z scale_ub_tensor = None 2025-05-07T20:32:35.3586438Z 2025-05-07T20:32:35.3586567Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3586652Z op = silu_mul_quant 2025-05-07T20:32:35.3586731Z if compiled: 2025-05-07T20:32:35.3586827Z op = torch.compile(op) 2025-05-07T20:32:35.3586928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3586999Z 2025-05-07T20:32:35.3587089Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3587094Z 2025-05-07T20:32:35.3587192Z moe/activation_test.py:117: 2025-05-07T20:32:35.3587320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3587424Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3587521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3587884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3587976Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3588458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3588555Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3588903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3589125Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3589460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3589550Z kernel = self.compile( 2025-05-07T20:32:35.3589926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3590094Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3590217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3590229Z 2025-05-07T20:32:35.3590431Z self = 2025-05-07T20:32:35.3591192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3591810Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb102439b20>} 2025-05-07T20:32:35.3592541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3592733Z context = 2025-05-07T20:32:35.3592776Z 2025-05-07T20:32:35.3592938Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3593194Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3593299Z module_map=module_map) 2025-05-07T20:32:35.3593455Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3593555Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3593628Z E ^ 2025-05-07T20:32:35.3593980Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3593984Z 2025-05-07T20:32:35.3594391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3594395Z 2025-05-07T20:32:35.3594492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3594716Z self=, 2025-05-07T20:32:35.3594791Z T=4096, 2025-05-07T20:32:35.3594866Z D=7168, 2025-05-07T20:32:35.3594949Z scale_ub=None, 2025-05-07T20:32:35.3595035Z contiguous=False, 2025-05-07T20:32:35.3595114Z compiled=True, 2025-05-07T20:32:35.3595185Z ) 2025-05-07T20:32:35.3595400Z self = 2025-05-07T20:32:35.3595568Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3595575Z 2025-05-07T20:32:35.3595653Z @given( 2025-05-07T20:32:35.3595771Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3595866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3595981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3596094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3596207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3596281Z ) 2025-05-07T20:32:35.3596520Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3596614Z def test_silu_mul_quant( 2025-05-07T20:32:35.3596690Z self, 2025-05-07T20:32:35.3596764Z T: int, 2025-05-07T20:32:35.3596840Z D: int, 2025-05-07T20:32:35.3596936Z scale_ub: Optional[float], 2025-05-07T20:32:35.3597022Z contiguous: bool, 2025-05-07T20:32:35.3597105Z compiled: bool, 2025-05-07T20:32:35.3597184Z ) -> None: 2025-05-07T20:32:35.3597276Z torch.manual_seed(2025) 2025-05-07T20:32:35.3597348Z 2025-05-07T20:32:35.3597516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3597591Z 2025-05-07T20:32:35.3597679Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3597802Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3597892Z x = x_sign * x_clamp 2025-05-07T20:32:35.3597968Z x0 = x[:, :D] 2025-05-07T20:32:35.3598046Z x1 = x[:, D:] 2025-05-07T20:32:35.3598125Z 2025-05-07T20:32:35.3598205Z if contiguous: 2025-05-07T20:32:35.3598291Z x0 = x0.contiguous() 2025-05-07T20:32:35.3598382Z x1 = x1.contiguous() 2025-05-07T20:32:35.3598452Z 2025-05-07T20:32:35.3598539Z if scale_ub is not None: 2025-05-07T20:32:35.3598643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3598772Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3598892Z ) 2025-05-07T20:32:35.3598967Z else: 2025-05-07T20:32:35.3599154Z scale_ub_tensor = None 2025-05-07T20:32:35.3599227Z 2025-05-07T20:32:35.3599353Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3599442Z op = silu_mul_quant 2025-05-07T20:32:35.3599530Z if compiled: 2025-05-07T20:32:35.3599627Z op = torch.compile(op) 2025-05-07T20:32:35.3599727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3599800Z 2025-05-07T20:32:35.3599929Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3599934Z 2025-05-07T20:32:35.3600025Z moe/activation_test.py:117: 2025-05-07T20:32:35.3600155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3600252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3600350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3600710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3600804Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3601294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3601388Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3601735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3601955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3602289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3602386Z kernel = self.compile( 2025-05-07T20:32:35.3602758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3602926Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3603061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3603072Z 2025-05-07T20:32:35.3603272Z self = 2025-05-07T20:32:35.3604034Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3604532Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10314f380>} 2025-05-07T20:32:35.3605261Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3605449Z context = 2025-05-07T20:32:35.3605454Z 2025-05-07T20:32:35.3605622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3605878Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3605980Z module_map=module_map) 2025-05-07T20:32:35.3606140Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3606238Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3606313Z E ^ 2025-05-07T20:32:35.3606665Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3606670Z 2025-05-07T20:32:35.3607074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3607078Z 2025-05-07T20:32:35.3607175Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3607440Z self=, 2025-05-07T20:32:35.3607584Z T=16384, 2025-05-07T20:32:35.3607660Z D=5120, 2025-05-07T20:32:35.3607743Z scale_ub=1200.0, 2025-05-07T20:32:35.3607827Z contiguous=False, 2025-05-07T20:32:35.3607919Z compiled=False, 2025-05-07T20:32:35.3607987Z ) 2025-05-07T20:32:35.3608201Z self = 2025-05-07T20:32:35.3608381Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.3608426Z 2025-05-07T20:32:35.3608502Z @given( 2025-05-07T20:32:35.3608617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3608718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3608831Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3608943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3609058Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3609127Z ) 2025-05-07T20:32:35.3609375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3609465Z def test_silu_mul_quant( 2025-05-07T20:32:35.3609538Z self, 2025-05-07T20:32:35.3609617Z T: int, 2025-05-07T20:32:35.3609688Z D: int, 2025-05-07T20:32:35.3609787Z scale_ub: Optional[float], 2025-05-07T20:32:35.3609878Z contiguous: bool, 2025-05-07T20:32:35.3609964Z compiled: bool, 2025-05-07T20:32:35.3610036Z ) -> None: 2025-05-07T20:32:35.3610133Z torch.manual_seed(2025) 2025-05-07T20:32:35.3610203Z 2025-05-07T20:32:35.3610368Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3610444Z 2025-05-07T20:32:35.3610533Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3610656Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3610741Z x = x_sign * x_clamp 2025-05-07T20:32:35.3610819Z x0 = x[:, :D] 2025-05-07T20:32:35.3610899Z x1 = x[:, D:] 2025-05-07T20:32:35.3610967Z 2025-05-07T20:32:35.3611053Z if contiguous: 2025-05-07T20:32:35.3611145Z x0 = x0.contiguous() 2025-05-07T20:32:35.3611232Z x1 = x1.contiguous() 2025-05-07T20:32:35.3611301Z 2025-05-07T20:32:35.3611392Z if scale_ub is not None: 2025-05-07T20:32:35.3611493Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3611623Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3611700Z ) 2025-05-07T20:32:35.3611774Z else: 2025-05-07T20:32:35.3611870Z scale_ub_tensor = None 2025-05-07T20:32:35.3611941Z 2025-05-07T20:32:35.3612067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3612156Z op = silu_mul_quant 2025-05-07T20:32:35.3612236Z if compiled: 2025-05-07T20:32:35.3612330Z op = torch.compile(op) 2025-05-07T20:32:35.3612437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3612506Z 2025-05-07T20:32:35.3612597Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3612601Z 2025-05-07T20:32:35.3612698Z moe/activation_test.py:117: 2025-05-07T20:32:35.3612821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3612923Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3613083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3613577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3613679Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3614029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3614246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3614628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3614720Z kernel = self.compile( 2025-05-07T20:32:35.3615172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3615343Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3615467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3615472Z 2025-05-07T20:32:35.3615676Z self = 2025-05-07T20:32:35.3616475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3616970Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10314df80>} 2025-05-07T20:32:35.3617709Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3617896Z context = 2025-05-07T20:32:35.3617904Z 2025-05-07T20:32:35.3618063Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3618318Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3618424Z module_map=module_map) 2025-05-07T20:32:35.3618581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3618678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3618754Z E ^ 2025-05-07T20:32:35.3619099Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3619106Z 2025-05-07T20:32:35.3619521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3619526Z 2025-05-07T20:32:35.3619629Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3619846Z self=, 2025-05-07T20:32:35.3619924Z T=16384, 2025-05-07T20:32:35.3619999Z D=5120, 2025-05-07T20:32:35.3620080Z scale_ub=1200.0, 2025-05-07T20:32:35.3620166Z contiguous=True, 2025-05-07T20:32:35.3620248Z compiled=True, 2025-05-07T20:32:35.3620318Z ) 2025-05-07T20:32:35.3620532Z self = 2025-05-07T20:32:35.3620702Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.3620707Z 2025-05-07T20:32:35.3620787Z @given( 2025-05-07T20:32:35.3620903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3620998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3621117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3621230Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3621339Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3621416Z ) 2025-05-07T20:32:35.3621654Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3621743Z def test_silu_mul_quant( 2025-05-07T20:32:35.3621823Z self, 2025-05-07T20:32:35.3621898Z T: int, 2025-05-07T20:32:35.3621976Z D: int, 2025-05-07T20:32:35.3622069Z scale_ub: Optional[float], 2025-05-07T20:32:35.3622154Z contiguous: bool, 2025-05-07T20:32:35.3622239Z compiled: bool, 2025-05-07T20:32:35.3622316Z ) -> None: 2025-05-07T20:32:35.3622407Z torch.manual_seed(2025) 2025-05-07T20:32:35.3622527Z 2025-05-07T20:32:35.3622692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3622762Z 2025-05-07T20:32:35.3622928Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3623052Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3623136Z x = x_sign * x_clamp 2025-05-07T20:32:35.3623213Z x0 = x[:, :D] 2025-05-07T20:32:35.3623288Z x1 = x[:, D:] 2025-05-07T20:32:35.3623363Z 2025-05-07T20:32:35.3623443Z if contiguous: 2025-05-07T20:32:35.3623531Z x0 = x0.contiguous() 2025-05-07T20:32:35.3623658Z x1 = x1.contiguous() 2025-05-07T20:32:35.3623731Z 2025-05-07T20:32:35.3623818Z if scale_ub is not None: 2025-05-07T20:32:35.3623925Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3624054Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3624124Z ) 2025-05-07T20:32:35.3624202Z else: 2025-05-07T20:32:35.3624296Z scale_ub_tensor = None 2025-05-07T20:32:35.3624367Z 2025-05-07T20:32:35.3624502Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3624589Z op = silu_mul_quant 2025-05-07T20:32:35.3624669Z if compiled: 2025-05-07T20:32:35.3624768Z op = torch.compile(op) 2025-05-07T20:32:35.3624868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3624940Z 2025-05-07T20:32:35.3625028Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3625032Z 2025-05-07T20:32:35.3625125Z moe/activation_test.py:117: 2025-05-07T20:32:35.3625255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3625355Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3625450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3625815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3625909Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3626401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3626495Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3626848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3627072Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3627403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3627496Z kernel = self.compile( 2025-05-07T20:32:35.3627871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3628040Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3628164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3628171Z 2025-05-07T20:32:35.3628377Z self = 2025-05-07T20:32:35.3629133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3629630Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb10314c180>} 2025-05-07T20:32:35.3630359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3630549Z context = 2025-05-07T20:32:35.3630623Z 2025-05-07T20:32:35.3630784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3631114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3631218Z module_map=module_map) 2025-05-07T20:32:35.3631374Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3631472Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3631548Z E ^ 2025-05-07T20:32:35.3631892Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3631935Z 2025-05-07T20:32:35.3632341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3632346Z 2025-05-07T20:32:35.3632446Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3632667Z self=, 2025-05-07T20:32:35.3632740Z T=16384, 2025-05-07T20:32:35.3632814Z D=5120, 2025-05-07T20:32:35.3632895Z scale_ub=None, 2025-05-07T20:32:35.3632983Z contiguous=False, 2025-05-07T20:32:35.3633063Z compiled=True, 2025-05-07T20:32:35.3633133Z ) 2025-05-07T20:32:35.3633346Z self = 2025-05-07T20:32:35.3633516Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3633525Z 2025-05-07T20:32:35.3633600Z @given( 2025-05-07T20:32:35.3633715Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3633817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3633928Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3634044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3634156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3634224Z ) 2025-05-07T20:32:35.3634465Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3634560Z def test_silu_mul_quant( 2025-05-07T20:32:35.3634634Z self, 2025-05-07T20:32:35.3634711Z T: int, 2025-05-07T20:32:35.3634788Z D: int, 2025-05-07T20:32:35.3634883Z scale_ub: Optional[float], 2025-05-07T20:32:35.3634973Z contiguous: bool, 2025-05-07T20:32:35.3635056Z compiled: bool, 2025-05-07T20:32:35.3635132Z ) -> None: 2025-05-07T20:32:35.3635225Z torch.manual_seed(2025) 2025-05-07T20:32:35.3635293Z 2025-05-07T20:32:35.3635457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3635536Z 2025-05-07T20:32:35.3635625Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3635745Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3635832Z x = x_sign * x_clamp 2025-05-07T20:32:35.3635910Z x0 = x[:, :D] 2025-05-07T20:32:35.3635985Z x1 = x[:, D:] 2025-05-07T20:32:35.3636060Z 2025-05-07T20:32:35.3636141Z if contiguous: 2025-05-07T20:32:35.3636233Z x0 = x0.contiguous() 2025-05-07T20:32:35.3636325Z x1 = x1.contiguous() 2025-05-07T20:32:35.3636392Z 2025-05-07T20:32:35.3636480Z if scale_ub is not None: 2025-05-07T20:32:35.3636581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3636710Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3636785Z ) 2025-05-07T20:32:35.3636859Z else: 2025-05-07T20:32:35.3636948Z scale_ub_tensor = None 2025-05-07T20:32:35.3637021Z 2025-05-07T20:32:35.3637146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3637231Z op = silu_mul_quant 2025-05-07T20:32:35.3637317Z if compiled: 2025-05-07T20:32:35.3637412Z op = torch.compile(op) 2025-05-07T20:32:35.3637515Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3637583Z 2025-05-07T20:32:35.3637721Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3637726Z 2025-05-07T20:32:35.3637824Z moe/activation_test.py:117: 2025-05-07T20:32:35.3638024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3638123Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3638222Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3638583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3638672Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3639196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3639290Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3639640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3639859Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3640197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3640294Z kernel = self.compile( 2025-05-07T20:32:35.3640667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3640837Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3640962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3640970Z 2025-05-07T20:32:35.3641170Z self = 2025-05-07T20:32:35.3641929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3642430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017c68b80>} 2025-05-07T20:32:35.3643161Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3643347Z context = 2025-05-07T20:32:35.3643351Z 2025-05-07T20:32:35.3643510Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3643770Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3643874Z module_map=module_map) 2025-05-07T20:32:35.3644034Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3644129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3644204Z E ^ 2025-05-07T20:32:35.3644562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3644566Z 2025-05-07T20:32:35.3644969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3644973Z 2025-05-07T20:32:35.3645076Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3645294Z self=, 2025-05-07T20:32:35.3645365Z T=2048, 2025-05-07T20:32:35.3645447Z D=5120, 2025-05-07T20:32:35.3645525Z scale_ub=None, 2025-05-07T20:32:35.3645612Z contiguous=False, 2025-05-07T20:32:35.3645693Z compiled=True, 2025-05-07T20:32:35.3645760Z ) 2025-05-07T20:32:35.3645972Z self = 2025-05-07T20:32:35.3646144Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3646197Z 2025-05-07T20:32:35.3646272Z @given( 2025-05-07T20:32:35.3646392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3646563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3646676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3646793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3646902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3646973Z ) 2025-05-07T20:32:35.3647214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3647343Z def test_silu_mul_quant( 2025-05-07T20:32:35.3647417Z self, 2025-05-07T20:32:35.3647493Z T: int, 2025-05-07T20:32:35.3647566Z D: int, 2025-05-07T20:32:35.3647663Z scale_ub: Optional[float], 2025-05-07T20:32:35.3647754Z contiguous: bool, 2025-05-07T20:32:35.3647836Z compiled: bool, 2025-05-07T20:32:35.3647914Z ) -> None: 2025-05-07T20:32:35.3648009Z torch.manual_seed(2025) 2025-05-07T20:32:35.3648082Z 2025-05-07T20:32:35.3648254Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3648327Z 2025-05-07T20:32:35.3648416Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3648541Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3648628Z x = x_sign * x_clamp 2025-05-07T20:32:35.3648705Z x0 = x[:, :D] 2025-05-07T20:32:35.3648785Z x1 = x[:, D:] 2025-05-07T20:32:35.3648857Z 2025-05-07T20:32:35.3648940Z if contiguous: 2025-05-07T20:32:35.3649034Z x0 = x0.contiguous() 2025-05-07T20:32:35.3649119Z x1 = x1.contiguous() 2025-05-07T20:32:35.3649195Z 2025-05-07T20:32:35.3649282Z if scale_ub is not None: 2025-05-07T20:32:35.3649385Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3649518Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3649593Z ) 2025-05-07T20:32:35.3649666Z else: 2025-05-07T20:32:35.3649759Z scale_ub_tensor = None 2025-05-07T20:32:35.3649828Z 2025-05-07T20:32:35.3649957Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3650048Z op = silu_mul_quant 2025-05-07T20:32:35.3650129Z if compiled: 2025-05-07T20:32:35.3650227Z op = torch.compile(op) 2025-05-07T20:32:35.3650331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3650401Z 2025-05-07T20:32:35.3650494Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3650502Z 2025-05-07T20:32:35.3650596Z moe/activation_test.py:117: 2025-05-07T20:32:35.3650721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3650821Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3650915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3651274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3651368Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3651852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3651950Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3652299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3652516Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3652851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3652941Z kernel = self.compile( 2025-05-07T20:32:35.3653363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3653536Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3653707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3653712Z 2025-05-07T20:32:35.3653988Z self = 2025-05-07T20:32:35.3654745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3655239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017c6a0c0>} 2025-05-07T20:32:35.3656036Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3656220Z context = 2025-05-07T20:32:35.3656227Z 2025-05-07T20:32:35.3656393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3656656Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3656761Z module_map=module_map) 2025-05-07T20:32:35.3656917Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3657013Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3657090Z E ^ 2025-05-07T20:32:35.3657437Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3657445Z 2025-05-07T20:32:35.3657848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3657853Z 2025-05-07T20:32:35.3657959Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3658179Z self=, 2025-05-07T20:32:35.3658257Z T=2048, 2025-05-07T20:32:35.3658333Z D=5120, 2025-05-07T20:32:35.3658418Z scale_ub=1200.0, 2025-05-07T20:32:35.3658506Z contiguous=False, 2025-05-07T20:32:35.3658590Z compiled=True, 2025-05-07T20:32:35.3658658Z ) 2025-05-07T20:32:35.3658873Z self = 2025-05-07T20:32:35.3659042Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3659046Z 2025-05-07T20:32:35.3659122Z @given( 2025-05-07T20:32:35.3659568Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3659711Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3659869Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3659985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3660096Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3660178Z ) 2025-05-07T20:32:35.3660419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3660517Z def test_silu_mul_quant( 2025-05-07T20:32:35.3660593Z self, 2025-05-07T20:32:35.3660669Z T: int, 2025-05-07T20:32:35.3660747Z D: int, 2025-05-07T20:32:35.3660846Z scale_ub: Optional[float], 2025-05-07T20:32:35.3660930Z contiguous: bool, 2025-05-07T20:32:35.3661010Z compiled: bool, 2025-05-07T20:32:35.3661090Z ) -> None: 2025-05-07T20:32:35.3661182Z torch.manual_seed(2025) 2025-05-07T20:32:35.3661259Z 2025-05-07T20:32:35.3661421Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3661489Z 2025-05-07T20:32:35.3661582Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3661707Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3661792Z x = x_sign * x_clamp 2025-05-07T20:32:35.3661870Z x0 = x[:, :D] 2025-05-07T20:32:35.3662040Z x1 = x[:, D:] 2025-05-07T20:32:35.3662113Z 2025-05-07T20:32:35.3662197Z if contiguous: 2025-05-07T20:32:35.3662390Z x0 = x0.contiguous() 2025-05-07T20:32:35.3662479Z x1 = x1.contiguous() 2025-05-07T20:32:35.3662553Z 2025-05-07T20:32:35.3662641Z if scale_ub is not None: 2025-05-07T20:32:35.3662745Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3662876Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3662948Z ) 2025-05-07T20:32:35.3663021Z else: 2025-05-07T20:32:35.3663173Z scale_ub_tensor = None 2025-05-07T20:32:35.3663241Z 2025-05-07T20:32:35.3663369Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3663456Z op = silu_mul_quant 2025-05-07T20:32:35.3663536Z if compiled: 2025-05-07T20:32:35.3663636Z op = torch.compile(op) 2025-05-07T20:32:35.3663737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3663809Z 2025-05-07T20:32:35.3663901Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3663905Z 2025-05-07T20:32:35.3664002Z moe/activation_test.py:117: 2025-05-07T20:32:35.3664132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3664230Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3664326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3664688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3664781Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3665265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3665366Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3665713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3665940Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3666275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3666363Z kernel = self.compile( 2025-05-07T20:32:35.3666736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3666905Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3667029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3667040Z 2025-05-07T20:32:35.3667240Z self = 2025-05-07T20:32:35.3667993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3668495Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb017c6b2e0>} 2025-05-07T20:32:35.3669225Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3669415Z context = 2025-05-07T20:32:35.3669422Z 2025-05-07T20:32:35.3669582Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3669836Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3669943Z module_map=module_map) 2025-05-07T20:32:35.3670101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3670247Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3670319Z E ^ 2025-05-07T20:32:35.3670734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3670739Z 2025-05-07T20:32:35.3671146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3671150Z 2025-05-07T20:32:35.3671249Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3671465Z self=, 2025-05-07T20:32:35.3671580Z T=4096, 2025-05-07T20:32:35.3671654Z D=5120, 2025-05-07T20:32:35.3671738Z scale_ub=1200.0, 2025-05-07T20:32:35.3671819Z contiguous=True, 2025-05-07T20:32:35.3671895Z compiled=True, 2025-05-07T20:32:35.3671969Z ) 2025-05-07T20:32:35.3672182Z self = 2025-05-07T20:32:35.3672348Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.3672353Z 2025-05-07T20:32:35.3672432Z @given( 2025-05-07T20:32:35.3672552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3672647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3672762Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3672877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3672988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3673058Z ) 2025-05-07T20:32:35.3673296Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3673398Z def test_silu_mul_quant( 2025-05-07T20:32:35.3673469Z self, 2025-05-07T20:32:35.3673545Z T: int, 2025-05-07T20:32:35.3673622Z D: int, 2025-05-07T20:32:35.3673719Z scale_ub: Optional[float], 2025-05-07T20:32:35.3673805Z contiguous: bool, 2025-05-07T20:32:35.3673891Z compiled: bool, 2025-05-07T20:32:35.3673966Z ) -> None: 2025-05-07T20:32:35.3674056Z torch.manual_seed(2025) 2025-05-07T20:32:35.3674129Z 2025-05-07T20:32:35.3674297Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3674375Z 2025-05-07T20:32:35.3674463Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3674586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3674676Z x = x_sign * x_clamp 2025-05-07T20:32:35.3674752Z x0 = x[:, :D] 2025-05-07T20:32:35.3674827Z x1 = x[:, D:] 2025-05-07T20:32:35.3674904Z 2025-05-07T20:32:35.3674984Z if contiguous: 2025-05-07T20:32:35.3675069Z x0 = x0.contiguous() 2025-05-07T20:32:35.3675157Z x1 = x1.contiguous() 2025-05-07T20:32:35.3675229Z 2025-05-07T20:32:35.3675317Z if scale_ub is not None: 2025-05-07T20:32:35.3675427Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3675558Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3675633Z ) 2025-05-07T20:32:35.3675709Z else: 2025-05-07T20:32:35.3675805Z scale_ub_tensor = None 2025-05-07T20:32:35.3675879Z 2025-05-07T20:32:35.3676005Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3676092Z op = silu_mul_quant 2025-05-07T20:32:35.3676179Z if compiled: 2025-05-07T20:32:35.3676278Z op = torch.compile(op) 2025-05-07T20:32:35.3676380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3676456Z 2025-05-07T20:32:35.3676543Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3676548Z 2025-05-07T20:32:35.3676640Z moe/activation_test.py:117: 2025-05-07T20:32:35.3676767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3676864Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3676963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3677372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3677576Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3678062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3678156Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3678507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3678729Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3679101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3679194Z kernel = self.compile( 2025-05-07T20:32:35.3679565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3679736Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3679868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3679872Z 2025-05-07T20:32:35.3680071Z self = 2025-05-07T20:32:35.3680829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3681323Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0172fc860>} 2025-05-07T20:32:35.3682050Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3682243Z context = 2025-05-07T20:32:35.3682247Z 2025-05-07T20:32:35.3682411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3682668Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3682769Z module_map=module_map) 2025-05-07T20:32:35.3682926Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3683024Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3683100Z E ^ 2025-05-07T20:32:35.3683447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3683452Z 2025-05-07T20:32:35.3683858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3683862Z 2025-05-07T20:32:35.3683964Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3684185Z self=, 2025-05-07T20:32:35.3684264Z T=128, 2025-05-07T20:32:35.3684339Z D=5120, 2025-05-07T20:32:35.3684423Z scale_ub=1200.0, 2025-05-07T20:32:35.3684508Z contiguous=False, 2025-05-07T20:32:35.3684593Z compiled=True, 2025-05-07T20:32:35.3684662Z ) 2025-05-07T20:32:35.3684873Z self = 2025-05-07T20:32:35.3685042Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3685050Z 2025-05-07T20:32:35.3685123Z @given( 2025-05-07T20:32:35.3685237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3685334Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3685444Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3685556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3685718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3685791Z ) 2025-05-07T20:32:35.3686131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3686225Z def test_silu_mul_quant( 2025-05-07T20:32:35.3686298Z self, 2025-05-07T20:32:35.3686376Z T: int, 2025-05-07T20:32:35.3686449Z D: int, 2025-05-07T20:32:35.3686546Z scale_ub: Optional[float], 2025-05-07T20:32:35.3686635Z contiguous: bool, 2025-05-07T20:32:35.3690385Z compiled: bool, 2025-05-07T20:32:35.3690472Z ) -> None: 2025-05-07T20:32:35.3690638Z torch.manual_seed(2025) 2025-05-07T20:32:35.3690707Z 2025-05-07T20:32:35.3690876Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3690951Z 2025-05-07T20:32:35.3691038Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3691160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3691252Z x = x_sign * x_clamp 2025-05-07T20:32:35.3691331Z x0 = x[:, :D] 2025-05-07T20:32:35.3691411Z x1 = x[:, D:] 2025-05-07T20:32:35.3691482Z 2025-05-07T20:32:35.3691568Z if contiguous: 2025-05-07T20:32:35.3691662Z x0 = x0.contiguous() 2025-05-07T20:32:35.3691748Z x1 = x1.contiguous() 2025-05-07T20:32:35.3691815Z 2025-05-07T20:32:35.3691905Z if scale_ub is not None: 2025-05-07T20:32:35.3692008Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3692138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3692217Z ) 2025-05-07T20:32:35.3692290Z else: 2025-05-07T20:32:35.3692381Z scale_ub_tensor = None 2025-05-07T20:32:35.3692453Z 2025-05-07T20:32:35.3692579Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3692669Z op = silu_mul_quant 2025-05-07T20:32:35.3692752Z if compiled: 2025-05-07T20:32:35.3692850Z op = torch.compile(op) 2025-05-07T20:32:35.3692956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3693099Z 2025-05-07T20:32:35.3693190Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3693195Z 2025-05-07T20:32:35.3693291Z moe/activation_test.py:117: 2025-05-07T20:32:35.3693417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3693515Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3693614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3693974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3694071Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3694552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3694644Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3694995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3695219Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3695575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3695685Z kernel = self.compile( 2025-05-07T20:32:35.3696066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3696237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3696363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3696367Z 2025-05-07T20:32:35.3696567Z self = 2025-05-07T20:32:35.3697330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3697956Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0172fd580>} 2025-05-07T20:32:35.3698692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3698877Z context = 2025-05-07T20:32:35.3698918Z 2025-05-07T20:32:35.3699083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3699338Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3699442Z module_map=module_map) 2025-05-07T20:32:35.3699604Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3699699Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3699774Z E ^ 2025-05-07T20:32:35.3700127Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3700132Z 2025-05-07T20:32:35.3700536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3700541Z 2025-05-07T20:32:35.3700640Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3700858Z self=, 2025-05-07T20:32:35.3700934Z T=16384, 2025-05-07T20:32:35.3701013Z D=7168, 2025-05-07T20:32:35.3701094Z scale_ub=1200.0, 2025-05-07T20:32:35.3701175Z contiguous=True, 2025-05-07T20:32:35.3701256Z compiled=True, 2025-05-07T20:32:35.3701325Z ) 2025-05-07T20:32:35.3701536Z self = 2025-05-07T20:32:35.3701715Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.3701725Z 2025-05-07T20:32:35.3701802Z @given( 2025-05-07T20:32:35.3701920Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3702014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3702125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3702241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3702350Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3702426Z ) 2025-05-07T20:32:35.3702666Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3702756Z def test_silu_mul_quant( 2025-05-07T20:32:35.3702834Z self, 2025-05-07T20:32:35.3702908Z T: int, 2025-05-07T20:32:35.3702980Z D: int, 2025-05-07T20:32:35.3703077Z scale_ub: Optional[float], 2025-05-07T20:32:35.3703167Z contiguous: bool, 2025-05-07T20:32:35.3703250Z compiled: bool, 2025-05-07T20:32:35.3703327Z ) -> None: 2025-05-07T20:32:35.3703423Z torch.manual_seed(2025) 2025-05-07T20:32:35.3703493Z 2025-05-07T20:32:35.3703661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3703731Z 2025-05-07T20:32:35.3703820Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3703946Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3704032Z x = x_sign * x_clamp 2025-05-07T20:32:35.3704112Z x0 = x[:, :D] 2025-05-07T20:32:35.3704192Z x1 = x[:, D:] 2025-05-07T20:32:35.3704261Z 2025-05-07T20:32:35.3704346Z if contiguous: 2025-05-07T20:32:35.3704440Z x0 = x0.contiguous() 2025-05-07T20:32:35.3704526Z x1 = x1.contiguous() 2025-05-07T20:32:35.3704600Z 2025-05-07T20:32:35.3704688Z if scale_ub is not None: 2025-05-07T20:32:35.3704789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3704972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3705042Z ) 2025-05-07T20:32:35.3705185Z else: 2025-05-07T20:32:35.3705280Z scale_ub_tensor = None 2025-05-07T20:32:35.3705349Z 2025-05-07T20:32:35.3705474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3705566Z op = silu_mul_quant 2025-05-07T20:32:35.3705646Z if compiled: 2025-05-07T20:32:35.3705748Z op = torch.compile(op) 2025-05-07T20:32:35.3705889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3705959Z 2025-05-07T20:32:35.3706051Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3706056Z 2025-05-07T20:32:35.3706148Z moe/activation_test.py:117: 2025-05-07T20:32:35.3706271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3706369Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3706469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3706832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3706924Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3707406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3707503Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3707850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3708071Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3708403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3708493Z kernel = self.compile( 2025-05-07T20:32:35.3708868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3709041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3709168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3709172Z 2025-05-07T20:32:35.3709374Z self = 2025-05-07T20:32:35.3710131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3710628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0172fe0c0>} 2025-05-07T20:32:35.3711354Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3711545Z context = 2025-05-07T20:32:35.3711549Z 2025-05-07T20:32:35.3711714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3711969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3712073Z module_map=module_map) 2025-05-07T20:32:35.3712230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3712330Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3712407Z E ^ 2025-05-07T20:32:35.3712753Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3712758Z 2025-05-07T20:32:35.3713164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3713214Z 2025-05-07T20:32:35.3713312Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3713599Z self=, 2025-05-07T20:32:35.3713677Z T=16384, 2025-05-07T20:32:35.3713751Z D=5120, 2025-05-07T20:32:35.3713831Z scale_ub=1200.0, 2025-05-07T20:32:35.3713916Z contiguous=True, 2025-05-07T20:32:35.3713996Z compiled=False, 2025-05-07T20:32:35.3714069Z ) 2025-05-07T20:32:35.3714283Z self = 2025-05-07T20:32:35.3714495Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3714499Z 2025-05-07T20:32:35.3714576Z @given( 2025-05-07T20:32:35.3714693Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3714791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3714905Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3715021Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3715131Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3715206Z ) 2025-05-07T20:32:35.3715444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3715533Z def test_silu_mul_quant( 2025-05-07T20:32:35.3715625Z self, 2025-05-07T20:32:35.3715705Z T: int, 2025-05-07T20:32:35.3715799Z D: int, 2025-05-07T20:32:35.3715907Z scale_ub: Optional[float], 2025-05-07T20:32:35.3715993Z contiguous: bool, 2025-05-07T20:32:35.3716082Z compiled: bool, 2025-05-07T20:32:35.3716155Z ) -> None: 2025-05-07T20:32:35.3716246Z torch.manual_seed(2025) 2025-05-07T20:32:35.3716320Z 2025-05-07T20:32:35.3716483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3716554Z 2025-05-07T20:32:35.3716644Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3716765Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3716854Z x = x_sign * x_clamp 2025-05-07T20:32:35.3716933Z x0 = x[:, :D] 2025-05-07T20:32:35.3717016Z x1 = x[:, D:] 2025-05-07T20:32:35.3717086Z 2025-05-07T20:32:35.3717170Z if contiguous: 2025-05-07T20:32:35.3717257Z x0 = x0.contiguous() 2025-05-07T20:32:35.3717345Z x1 = x1.contiguous() 2025-05-07T20:32:35.3717417Z 2025-05-07T20:32:35.3717505Z if scale_ub is not None: 2025-05-07T20:32:35.3717610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3717745Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3717818Z ) 2025-05-07T20:32:35.3717892Z else: 2025-05-07T20:32:35.3717983Z scale_ub_tensor = None 2025-05-07T20:32:35.3718055Z 2025-05-07T20:32:35.3718185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3718273Z op = silu_mul_quant 2025-05-07T20:32:35.3718357Z if compiled: 2025-05-07T20:32:35.3718455Z op = torch.compile(op) 2025-05-07T20:32:35.3718560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3718633Z 2025-05-07T20:32:35.3718721Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3718727Z 2025-05-07T20:32:35.3718819Z moe/activation_test.py:117: 2025-05-07T20:32:35.3718947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3719043Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3719140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3719633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3719726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3720080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3720296Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3720771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3720865Z kernel = self.compile( 2025-05-07T20:32:35.3721238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3721407Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3721534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3721577Z 2025-05-07T20:32:35.3721777Z self = 2025-05-07T20:32:35.3722538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3725537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb0172ff1a0>} 2025-05-07T20:32:35.3726291Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3726485Z context = 2025-05-07T20:32:35.3726491Z 2025-05-07T20:32:35.3726651Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3726917Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3727023Z module_map=module_map) 2025-05-07T20:32:35.3727182Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3727287Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3727367Z E ^ 2025-05-07T20:32:35.3727723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3727750Z 2025-05-07T20:32:35.3728158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3728163Z 2025-05-07T20:32:35.3728268Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3728491Z self=, 2025-05-07T20:32:35.3728566Z T=1, 2025-05-07T20:32:35.3728643Z D=7168, 2025-05-07T20:32:35.3728729Z scale_ub=1200.0, 2025-05-07T20:32:35.3728812Z contiguous=False, 2025-05-07T20:32:35.3728892Z compiled=False, 2025-05-07T20:32:35.3728966Z ) 2025-05-07T20:32:35.3729177Z self = 2025-05-07T20:32:35.3729341Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.3729349Z 2025-05-07T20:32:35.3729429Z @given( 2025-05-07T20:32:35.3729546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3729653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3729765Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3729882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3729999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3730072Z ) 2025-05-07T20:32:35.3730312Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3730410Z def test_silu_mul_quant( 2025-05-07T20:32:35.3730483Z self, 2025-05-07T20:32:35.3730557Z T: int, 2025-05-07T20:32:35.3730633Z D: int, 2025-05-07T20:32:35.3730729Z scale_ub: Optional[float], 2025-05-07T20:32:35.3730814Z contiguous: bool, 2025-05-07T20:32:35.3730898Z compiled: bool, 2025-05-07T20:32:35.3731032Z ) -> None: 2025-05-07T20:32:35.3731129Z torch.manual_seed(2025) 2025-05-07T20:32:35.3731202Z 2025-05-07T20:32:35.3731408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3731485Z 2025-05-07T20:32:35.3731574Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3731698Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3731788Z x = x_sign * x_clamp 2025-05-07T20:32:35.3731866Z x0 = x[:, :D] 2025-05-07T20:32:35.3731945Z x1 = x[:, D:] 2025-05-07T20:32:35.3732020Z 2025-05-07T20:32:35.3732141Z if contiguous: 2025-05-07T20:32:35.3732230Z x0 = x0.contiguous() 2025-05-07T20:32:35.3732319Z x1 = x1.contiguous() 2025-05-07T20:32:35.3732390Z 2025-05-07T20:32:35.3732479Z if scale_ub is not None: 2025-05-07T20:32:35.3732581Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3732710Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3732791Z ) 2025-05-07T20:32:35.3732869Z else: 2025-05-07T20:32:35.3732959Z scale_ub_tensor = None 2025-05-07T20:32:35.3733097Z 2025-05-07T20:32:35.3733307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3733396Z op = silu_mul_quant 2025-05-07T20:32:35.3733479Z if compiled: 2025-05-07T20:32:35.3733576Z op = torch.compile(op) 2025-05-07T20:32:35.3733677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3733752Z 2025-05-07T20:32:35.3733840Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3733848Z 2025-05-07T20:32:35.3733948Z moe/activation_test.py:117: 2025-05-07T20:32:35.3734074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3734170Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3734268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3734758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3734855Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3735215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3735433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3735767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3735856Z kernel = self.compile( 2025-05-07T20:32:35.3736231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3736406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3736530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3736534Z 2025-05-07T20:32:35.3736740Z self = 2025-05-07T20:32:35.3737506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3737999Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016dbc680>} 2025-05-07T20:32:35.3738734Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3738922Z context = 2025-05-07T20:32:35.3738926Z 2025-05-07T20:32:35.3739090Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3739392Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3739534Z module_map=module_map) 2025-05-07T20:32:35.3739703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3739797Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3739870Z E ^ 2025-05-07T20:32:35.3740220Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3740225Z 2025-05-07T20:32:35.3740628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3740671Z 2025-05-07T20:32:35.3740774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3740989Z self=, 2025-05-07T20:32:35.3741061Z T=4096, 2025-05-07T20:32:35.3741141Z D=7168, 2025-05-07T20:32:35.3741226Z scale_ub=1200.0, 2025-05-07T20:32:35.3741316Z contiguous=False, 2025-05-07T20:32:35.3741396Z compiled=True, 2025-05-07T20:32:35.3741466Z ) 2025-05-07T20:32:35.3741738Z self = 2025-05-07T20:32:35.3741910Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3741915Z 2025-05-07T20:32:35.3741986Z @given( 2025-05-07T20:32:35.3742107Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3742205Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3742320Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3742435Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3742546Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3742621Z ) 2025-05-07T20:32:35.3742859Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3742950Z def test_silu_mul_quant( 2025-05-07T20:32:35.3743030Z self, 2025-05-07T20:32:35.3743103Z T: int, 2025-05-07T20:32:35.3743176Z D: int, 2025-05-07T20:32:35.3743279Z scale_ub: Optional[float], 2025-05-07T20:32:35.3743367Z contiguous: bool, 2025-05-07T20:32:35.3743449Z compiled: bool, 2025-05-07T20:32:35.3743526Z ) -> None: 2025-05-07T20:32:35.3743618Z torch.manual_seed(2025) 2025-05-07T20:32:35.3743688Z 2025-05-07T20:32:35.3743854Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3743923Z 2025-05-07T20:32:35.3744017Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3744145Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3744232Z x = x_sign * x_clamp 2025-05-07T20:32:35.3744313Z x0 = x[:, :D] 2025-05-07T20:32:35.3744390Z x1 = x[:, D:] 2025-05-07T20:32:35.3744461Z 2025-05-07T20:32:35.3744542Z if contiguous: 2025-05-07T20:32:35.3744630Z x0 = x0.contiguous() 2025-05-07T20:32:35.3744719Z x1 = x1.contiguous() 2025-05-07T20:32:35.3744795Z 2025-05-07T20:32:35.3744886Z if scale_ub is not None: 2025-05-07T20:32:35.3744990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3745123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3745194Z ) 2025-05-07T20:32:35.3745267Z else: 2025-05-07T20:32:35.3745365Z scale_ub_tensor = None 2025-05-07T20:32:35.3745454Z 2025-05-07T20:32:35.3745601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3745704Z op = silu_mul_quant 2025-05-07T20:32:35.3745788Z if compiled: 2025-05-07T20:32:35.3745888Z op = torch.compile(op) 2025-05-07T20:32:35.3745990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3746064Z 2025-05-07T20:32:35.3746158Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3746163Z 2025-05-07T20:32:35.3746304Z moe/activation_test.py:117: 2025-05-07T20:32:35.3746431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3746574Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3746672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3747033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3747123Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3747604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3747741Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3748091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3748308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3748642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3748742Z kernel = self.compile( 2025-05-07T20:32:35.3749184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3749357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3749481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3749486Z 2025-05-07T20:32:35.3749691Z self = 2025-05-07T20:32:35.3750451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3750953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016dbd940>} 2025-05-07T20:32:35.3751688Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3751880Z context = 2025-05-07T20:32:35.3751885Z 2025-05-07T20:32:35.3752045Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3752299Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3752410Z module_map=module_map) 2025-05-07T20:32:35.3752568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3752663Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3752742Z E ^ 2025-05-07T20:32:35.3753090Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3753097Z 2025-05-07T20:32:35.3753506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3753511Z 2025-05-07T20:32:35.3753613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3753828Z self=, 2025-05-07T20:32:35.3753908Z T=128, 2025-05-07T20:32:35.3753984Z D=7168, 2025-05-07T20:32:35.3754062Z scale_ub=1200.0, 2025-05-07T20:32:35.3754150Z contiguous=False, 2025-05-07T20:32:35.3754231Z compiled=True, 2025-05-07T20:32:35.3754306Z ) 2025-05-07T20:32:35.3754518Z self = 2025-05-07T20:32:35.3754683Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.3754688Z 2025-05-07T20:32:35.3754766Z @given( 2025-05-07T20:32:35.3754883Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3755025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3755177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3755296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3755411Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3755483Z ) 2025-05-07T20:32:35.3755723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3755819Z def test_silu_mul_quant( 2025-05-07T20:32:35.3755892Z self, 2025-05-07T20:32:35.3756007Z T: int, 2025-05-07T20:32:35.3756083Z D: int, 2025-05-07T20:32:35.3756179Z scale_ub: Optional[float], 2025-05-07T20:32:35.3756269Z contiguous: bool, 2025-05-07T20:32:35.3756354Z compiled: bool, 2025-05-07T20:32:35.3756429Z ) -> None: 2025-05-07T20:32:35.3756522Z torch.manual_seed(2025) 2025-05-07T20:32:35.3756598Z 2025-05-07T20:32:35.3756766Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3756839Z 2025-05-07T20:32:35.3756935Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3757103Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3757194Z x = x_sign * x_clamp 2025-05-07T20:32:35.3757272Z x0 = x[:, :D] 2025-05-07T20:32:35.3757347Z x1 = x[:, D:] 2025-05-07T20:32:35.3757417Z 2025-05-07T20:32:35.3757497Z if contiguous: 2025-05-07T20:32:35.3757586Z x0 = x0.contiguous() 2025-05-07T20:32:35.3757676Z x1 = x1.contiguous() 2025-05-07T20:32:35.3757751Z 2025-05-07T20:32:35.3757840Z if scale_ub is not None: 2025-05-07T20:32:35.3757946Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3758077Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3758151Z ) 2025-05-07T20:32:35.3758231Z else: 2025-05-07T20:32:35.3758323Z scale_ub_tensor = None 2025-05-07T20:32:35.3758401Z 2025-05-07T20:32:35.3758527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3758620Z op = silu_mul_quant 2025-05-07T20:32:35.3758710Z if compiled: 2025-05-07T20:32:35.3758805Z op = torch.compile(op) 2025-05-07T20:32:35.3758912Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3758985Z 2025-05-07T20:32:35.3759076Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3759080Z 2025-05-07T20:32:35.3759175Z moe/activation_test.py:117: 2025-05-07T20:32:35.3759594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3759707Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3759804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3760170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3760260Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3760749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3760849Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3761198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3761419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3761748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3761843Z kernel = self.compile( 2025-05-07T20:32:35.3762215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3762386Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3762512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3762603Z 2025-05-07T20:32:35.3762804Z self = 2025-05-07T20:32:35.3763626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3764124Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016dbe700>} 2025-05-07T20:32:35.3764911Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3765101Z context = 2025-05-07T20:32:35.3765106Z 2025-05-07T20:32:35.3765267Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3765530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3765693Z module_map=module_map) 2025-05-07T20:32:35.3765852Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3765950Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3766026Z E ^ 2025-05-07T20:32:35.3766369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3766382Z 2025-05-07T20:32:35.3766788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3766793Z 2025-05-07T20:32:35.3766891Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3767111Z self=, 2025-05-07T20:32:35.3767189Z T=2048, 2025-05-07T20:32:35.3767265Z D=7168, 2025-05-07T20:32:35.3767343Z scale_ub=None, 2025-05-07T20:32:35.3767423Z contiguous=True, 2025-05-07T20:32:35.3767506Z compiled=True, 2025-05-07T20:32:35.3767584Z ) 2025-05-07T20:32:35.3767798Z self = 2025-05-07T20:32:35.3767968Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3767972Z 2025-05-07T20:32:35.3768047Z @given( 2025-05-07T20:32:35.3768163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3768266Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3768378Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3768495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3768608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3768684Z ) 2025-05-07T20:32:35.3768923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3769020Z def test_silu_mul_quant( 2025-05-07T20:32:35.3769096Z self, 2025-05-07T20:32:35.3769173Z T: int, 2025-05-07T20:32:35.3769250Z D: int, 2025-05-07T20:32:35.3769346Z scale_ub: Optional[float], 2025-05-07T20:32:35.3769437Z contiguous: bool, 2025-05-07T20:32:35.3769524Z compiled: bool, 2025-05-07T20:32:35.3769601Z ) -> None: 2025-05-07T20:32:35.3769696Z torch.manual_seed(2025) 2025-05-07T20:32:35.3769766Z 2025-05-07T20:32:35.3769930Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3770011Z 2025-05-07T20:32:35.3770098Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3770221Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3770309Z x = x_sign * x_clamp 2025-05-07T20:32:35.3770386Z x0 = x[:, :D] 2025-05-07T20:32:35.3770467Z x1 = x[:, D:] 2025-05-07T20:32:35.3770537Z 2025-05-07T20:32:35.3770668Z if contiguous: 2025-05-07T20:32:35.3770761Z x0 = x0.contiguous() 2025-05-07T20:32:35.3770849Z x1 = x1.contiguous() 2025-05-07T20:32:35.3770963Z 2025-05-07T20:32:35.3771056Z if scale_ub is not None: 2025-05-07T20:32:35.3771163Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3771294Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3771373Z ) 2025-05-07T20:32:35.3771446Z else: 2025-05-07T20:32:35.3771539Z scale_ub_tensor = None 2025-05-07T20:32:35.3771610Z 2025-05-07T20:32:35.3771777Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3771871Z op = silu_mul_quant 2025-05-07T20:32:35.3771952Z if compiled: 2025-05-07T20:32:35.3772048Z op = torch.compile(op) 2025-05-07T20:32:35.3772153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3772226Z 2025-05-07T20:32:35.3772313Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3772320Z 2025-05-07T20:32:35.3772415Z moe/activation_test.py:117: 2025-05-07T20:32:35.3772588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3772690Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3772797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3773224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3773321Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3773804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3773908Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3774257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3774475Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3774807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3774903Z kernel = self.compile( 2025-05-07T20:32:35.3775278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3775478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3775618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3775624Z 2025-05-07T20:32:35.3775831Z self = 2025-05-07T20:32:35.3776593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3777087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016dbf7e0>} 2025-05-07T20:32:35.3777821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3778007Z context = 2025-05-07T20:32:35.3778011Z 2025-05-07T20:32:35.3778176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3778434Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3778538Z module_map=module_map) 2025-05-07T20:32:35.3778699Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3778795Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3778869Z E ^ 2025-05-07T20:32:35.3779266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3779271Z 2025-05-07T20:32:35.3779714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3779719Z 2025-05-07T20:32:35.3779823Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3780040Z self=, 2025-05-07T20:32:35.3780115Z T=16384, 2025-05-07T20:32:35.3780193Z D=5120, 2025-05-07T20:32:35.3780335Z scale_ub=None, 2025-05-07T20:32:35.3780417Z contiguous=False, 2025-05-07T20:32:35.3780501Z compiled=False, 2025-05-07T20:32:35.3780572Z ) 2025-05-07T20:32:35.3780787Z self = 2025-05-07T20:32:35.3780962Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3780967Z 2025-05-07T20:32:35.3781042Z @given( 2025-05-07T20:32:35.3781163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3781263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3781419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3781541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3781651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3781721Z ) 2025-05-07T20:32:35.3781963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3782055Z def test_silu_mul_quant( 2025-05-07T20:32:35.3782134Z self, 2025-05-07T20:32:35.3782213Z T: int, 2025-05-07T20:32:35.3782290Z D: int, 2025-05-07T20:32:35.3782390Z scale_ub: Optional[float], 2025-05-07T20:32:35.3782476Z contiguous: bool, 2025-05-07T20:32:35.3782559Z compiled: bool, 2025-05-07T20:32:35.3782637Z ) -> None: 2025-05-07T20:32:35.3782727Z torch.manual_seed(2025) 2025-05-07T20:32:35.3782801Z 2025-05-07T20:32:35.3782970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3783045Z 2025-05-07T20:32:35.3783136Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3783263Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3785045Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3785054Z 2025-05-07T20:32:35.3785172Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.3785179Z 2025-05-07T20:32:35.3785277Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3785504Z self=, 2025-05-07T20:32:35.3785580Z T=4096, 2025-05-07T20:32:35.3785650Z D=7168, 2025-05-07T20:32:35.3785732Z scale_ub=1200.0, 2025-05-07T20:32:35.3785814Z contiguous=True, 2025-05-07T20:32:35.3785895Z compiled=True, 2025-05-07T20:32:35.3785965Z ) 2025-05-07T20:32:35.3786177Z self = 2025-05-07T20:32:35.3786345Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.3786350Z 2025-05-07T20:32:35.3786432Z @given( 2025-05-07T20:32:35.3786544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3786645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3786755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3786913Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3787026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3787097Z ) 2025-05-07T20:32:35.3787376Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3787470Z def test_silu_mul_quant( 2025-05-07T20:32:35.3787540Z self, 2025-05-07T20:32:35.3787616Z T: int, 2025-05-07T20:32:35.3787693Z D: int, 2025-05-07T20:32:35.3787788Z scale_ub: Optional[float], 2025-05-07T20:32:35.3787873Z contiguous: bool, 2025-05-07T20:32:35.3788004Z compiled: bool, 2025-05-07T20:32:35.3788080Z ) -> None: 2025-05-07T20:32:35.3788175Z torch.manual_seed(2025) 2025-05-07T20:32:35.3788245Z 2025-05-07T20:32:35.3788408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3788480Z 2025-05-07T20:32:35.3788570Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3788693Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3790498Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3790507Z 2025-05-07T20:32:35.3790622Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.3790627Z 2025-05-07T20:32:35.3790730Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3790950Z self=, 2025-05-07T20:32:35.3791031Z T=16384, 2025-05-07T20:32:35.3791113Z D=7168, 2025-05-07T20:32:35.3791193Z scale_ub=None, 2025-05-07T20:32:35.3791275Z contiguous=False, 2025-05-07T20:32:35.3791357Z compiled=False, 2025-05-07T20:32:35.3791428Z ) 2025-05-07T20:32:35.3791646Z self = 2025-05-07T20:32:35.3791816Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3791821Z 2025-05-07T20:32:35.3791897Z @given( 2025-05-07T20:32:35.3792018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3792114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3792227Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3792342Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3792453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3792527Z ) 2025-05-07T20:32:35.3792767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3792861Z def test_silu_mul_quant( 2025-05-07T20:32:35.3792941Z self, 2025-05-07T20:32:35.3793013Z T: int, 2025-05-07T20:32:35.3793087Z D: int, 2025-05-07T20:32:35.3793186Z scale_ub: Optional[float], 2025-05-07T20:32:35.3793271Z contiguous: bool, 2025-05-07T20:32:35.3793355Z compiled: bool, 2025-05-07T20:32:35.3793434Z ) -> None: 2025-05-07T20:32:35.3793528Z torch.manual_seed(2025) 2025-05-07T20:32:35.3793601Z 2025-05-07T20:32:35.3793767Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3795568Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3795616Z 2025-05-07T20:32:35.3795733Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3795738Z 2025-05-07T20:32:35.3795834Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3796053Z self=, 2025-05-07T20:32:35.3796126Z T=2048, 2025-05-07T20:32:35.3796200Z D=7168, 2025-05-07T20:32:35.3796280Z scale_ub=1200.0, 2025-05-07T20:32:35.3796404Z contiguous=True, 2025-05-07T20:32:35.3796484Z compiled=True, 2025-05-07T20:32:35.3796556Z ) 2025-05-07T20:32:35.3796768Z self = 2025-05-07T20:32:35.3796933Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.3796941Z 2025-05-07T20:32:35.3797016Z @given( 2025-05-07T20:32:35.3797132Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3797229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3797394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3797511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3797623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3797697Z ) 2025-05-07T20:32:35.3797934Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3798027Z def test_silu_mul_quant( 2025-05-07T20:32:35.3798105Z self, 2025-05-07T20:32:35.3798177Z T: int, 2025-05-07T20:32:35.3798257Z D: int, 2025-05-07T20:32:35.3798353Z scale_ub: Optional[float], 2025-05-07T20:32:35.3798445Z contiguous: bool, 2025-05-07T20:32:35.3798531Z compiled: bool, 2025-05-07T20:32:35.3798604Z ) -> None: 2025-05-07T20:32:35.3798699Z torch.manual_seed(2025) 2025-05-07T20:32:35.3798777Z 2025-05-07T20:32:35.3798938Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3799013Z 2025-05-07T20:32:35.3799109Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3799231Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3800974Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3800983Z 2025-05-07T20:32:35.3801097Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.3801104Z 2025-05-07T20:32:35.3801206Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3801423Z self=, 2025-05-07T20:32:35.3801503Z T=2048, 2025-05-07T20:32:35.3801575Z D=7168, 2025-05-07T20:32:35.3801653Z scale_ub=None, 2025-05-07T20:32:35.3801739Z contiguous=True, 2025-05-07T20:32:35.3801820Z compiled=False, 2025-05-07T20:32:35.3801889Z ) 2025-05-07T20:32:35.3802102Z self = 2025-05-07T20:32:35.3802268Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3802275Z 2025-05-07T20:32:35.3802349Z @given( 2025-05-07T20:32:35.3802466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3802563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3802675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3802787Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3802941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3803015Z ) 2025-05-07T20:32:35.3803289Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3803381Z def test_silu_mul_quant( 2025-05-07T20:32:35.3803457Z self, 2025-05-07T20:32:35.3803530Z T: int, 2025-05-07T20:32:35.3803605Z D: int, 2025-05-07T20:32:35.3803704Z scale_ub: Optional[float], 2025-05-07T20:32:35.3803791Z contiguous: bool, 2025-05-07T20:32:35.3803875Z compiled: bool, 2025-05-07T20:32:35.3803991Z ) -> None: 2025-05-07T20:32:35.3804083Z torch.manual_seed(2025) 2025-05-07T20:32:35.3804158Z 2025-05-07T20:32:35.3804319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3804390Z 2025-05-07T20:32:35.3804481Z > x_sign = torch.sign(x) 2025-05-07T20:32:35.3806259Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3806268Z 2025-05-07T20:32:35.3806386Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:35.3806394Z 2025-05-07T20:32:35.3806491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3806707Z self=, 2025-05-07T20:32:35.3806784Z T=1, 2025-05-07T20:32:35.3806858Z D=7168, 2025-05-07T20:32:35.3806939Z scale_ub=1200.0, 2025-05-07T20:32:35.3807022Z contiguous=True, 2025-05-07T20:32:35.3807105Z compiled=False, 2025-05-07T20:32:35.3807181Z ) 2025-05-07T20:32:35.3807393Z self = 2025-05-07T20:32:35.3807557Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3807562Z 2025-05-07T20:32:35.3807636Z @given( 2025-05-07T20:32:35.3807748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3807846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3807960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3808077Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3808185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3808264Z ) 2025-05-07T20:32:35.3808503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3808594Z def test_silu_mul_quant( 2025-05-07T20:32:35.3808667Z self, 2025-05-07T20:32:35.3808740Z T: int, 2025-05-07T20:32:35.3808818Z D: int, 2025-05-07T20:32:35.3808914Z scale_ub: Optional[float], 2025-05-07T20:32:35.3809001Z contiguous: bool, 2025-05-07T20:32:35.3809091Z compiled: bool, 2025-05-07T20:32:35.3809167Z ) -> None: 2025-05-07T20:32:35.3809257Z torch.manual_seed(2025) 2025-05-07T20:32:35.3809328Z 2025-05-07T20:32:35.3809492Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3809566Z 2025-05-07T20:32:35.3809658Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3813911Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3814019Z x = x_sign * x_clamp 2025-05-07T20:32:35.3814105Z x0 = x[:, :D] 2025-05-07T20:32:35.3814182Z x1 = x[:, D:] 2025-05-07T20:32:35.3814257Z 2025-05-07T20:32:35.3814340Z if contiguous: 2025-05-07T20:32:35.3814432Z x0 = x0.contiguous() 2025-05-07T20:32:35.3814523Z x1 = x1.contiguous() 2025-05-07T20:32:35.3814670Z 2025-05-07T20:32:35.3814763Z if scale_ub is not None: 2025-05-07T20:32:35.3814873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3815081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3815156Z ) 2025-05-07T20:32:35.3815235Z else: 2025-05-07T20:32:35.3815328Z scale_ub_tensor = None 2025-05-07T20:32:35.3815404Z 2025-05-07T20:32:35.3815536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3815650Z op = silu_mul_quant 2025-05-07T20:32:35.3815797Z if compiled: 2025-05-07T20:32:35.3815909Z op = torch.compile(op) 2025-05-07T20:32:35.3816012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3816088Z 2025-05-07T20:32:35.3816180Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3816185Z 2025-05-07T20:32:35.3816284Z moe/activation_test.py:117: 2025-05-07T20:32:35.3816417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3816520Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3816622Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3817177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3817276Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3817637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3817858Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3818195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3818290Z kernel = self.compile( 2025-05-07T20:32:35.3818668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3818848Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3818978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3818985Z 2025-05-07T20:32:35.3819187Z self = 2025-05-07T20:32:35.3819956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3820457Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016cd2b60>} 2025-05-07T20:32:35.3821192Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3821383Z context = 2025-05-07T20:32:35.3821387Z 2025-05-07T20:32:35.3821555Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3821815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3821922Z module_map=module_map) 2025-05-07T20:32:35.3822087Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3822184Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3822263Z E ^ 2025-05-07T20:32:35.3822616Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3822621Z 2025-05-07T20:32:35.3823026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3823030Z 2025-05-07T20:32:35.3823179Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3823400Z self=, 2025-05-07T20:32:35.3823516Z T=128, 2025-05-07T20:32:35.3823601Z D=5120, 2025-05-07T20:32:35.3823682Z scale_ub=None, 2025-05-07T20:32:35.3823766Z contiguous=True, 2025-05-07T20:32:35.3823851Z compiled=False, 2025-05-07T20:32:35.3823926Z ) 2025-05-07T20:32:35.3824139Z self = 2025-05-07T20:32:35.3824310Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3824354Z 2025-05-07T20:32:35.3824432Z @given( 2025-05-07T20:32:35.3824551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3824658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3824773Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3824892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3825006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3825081Z ) 2025-05-07T20:32:35.3825373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3825466Z def test_silu_mul_quant( 2025-05-07T20:32:35.3825542Z self, 2025-05-07T20:32:35.3825625Z T: int, 2025-05-07T20:32:35.3825699Z D: int, 2025-05-07T20:32:35.3825807Z scale_ub: Optional[float], 2025-05-07T20:32:35.3825912Z contiguous: bool, 2025-05-07T20:32:35.3826010Z compiled: bool, 2025-05-07T20:32:35.3826100Z ) -> None: 2025-05-07T20:32:35.3826198Z torch.manual_seed(2025) 2025-05-07T20:32:35.3826274Z 2025-05-07T20:32:35.3826445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3826518Z 2025-05-07T20:32:35.3826611Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3826740Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3826826Z x = x_sign * x_clamp 2025-05-07T20:32:35.3826906Z x0 = x[:, :D] 2025-05-07T20:32:35.3826988Z x1 = x[:, D:] 2025-05-07T20:32:35.3827061Z 2025-05-07T20:32:35.3827150Z if contiguous: 2025-05-07T20:32:35.3827248Z x0 = x0.contiguous() 2025-05-07T20:32:35.3827338Z x1 = x1.contiguous() 2025-05-07T20:32:35.3827411Z 2025-05-07T20:32:35.3827504Z if scale_ub is not None: 2025-05-07T20:32:35.3827610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3827745Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3827820Z ) 2025-05-07T20:32:35.3827897Z else: 2025-05-07T20:32:35.3827996Z scale_ub_tensor = None 2025-05-07T20:32:35.3828066Z 2025-05-07T20:32:35.3828193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3828289Z op = silu_mul_quant 2025-05-07T20:32:35.3828375Z if compiled: 2025-05-07T20:32:35.3828472Z op = torch.compile(op) 2025-05-07T20:32:35.3828581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3828651Z 2025-05-07T20:32:35.3828743Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3828752Z 2025-05-07T20:32:35.3828850Z moe/activation_test.py:117: 2025-05-07T20:32:35.3828977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3829077Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3829174Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3829666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3829767Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3830119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3830340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3830723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3830854Z kernel = self.compile( 2025-05-07T20:32:35.3831232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3831401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3831522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3831527Z 2025-05-07T20:32:35.3831731Z self = 2025-05-07T20:32:35.3832530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3833028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016cd3c40>} 2025-05-07T20:32:35.3833807Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3833999Z context = 2025-05-07T20:32:35.3834003Z 2025-05-07T20:32:35.3834164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3834645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3834758Z module_map=module_map) 2025-05-07T20:32:35.3834917Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3835011Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3835090Z E ^ 2025-05-07T20:32:35.3835435Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3835443Z 2025-05-07T20:32:35.3835855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3835860Z 2025-05-07T20:32:35.3835958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3836175Z self=, 2025-05-07T20:32:35.3836251Z T=128, 2025-05-07T20:32:35.3836326Z D=7168, 2025-05-07T20:32:35.3836405Z scale_ub=None, 2025-05-07T20:32:35.3836496Z contiguous=True, 2025-05-07T20:32:35.3836580Z compiled=False, 2025-05-07T20:32:35.3836656Z ) 2025-05-07T20:32:35.3836869Z self = 2025-05-07T20:32:35.3837034Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3837040Z 2025-05-07T20:32:35.3837121Z @given( 2025-05-07T20:32:35.3837242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3837340Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3837459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3837572Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3837682Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3837757Z ) 2025-05-07T20:32:35.3837996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3838089Z def test_silu_mul_quant( 2025-05-07T20:32:35.3838166Z self, 2025-05-07T20:32:35.3838240Z T: int, 2025-05-07T20:32:35.3838319Z D: int, 2025-05-07T20:32:35.3838417Z scale_ub: Optional[float], 2025-05-07T20:32:35.3838503Z contiguous: bool, 2025-05-07T20:32:35.3838589Z compiled: bool, 2025-05-07T20:32:35.3838665Z ) -> None: 2025-05-07T20:32:35.3838755Z torch.manual_seed(2025) 2025-05-07T20:32:35.3838879Z 2025-05-07T20:32:35.3839044Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3839117Z 2025-05-07T20:32:35.3839247Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3839374Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3839466Z x = x_sign * x_clamp 2025-05-07T20:32:35.3839544Z x0 = x[:, :D] 2025-05-07T20:32:35.3839621Z x1 = x[:, D:] 2025-05-07T20:32:35.3839696Z 2025-05-07T20:32:35.3839775Z if contiguous: 2025-05-07T20:32:35.3839864Z x0 = x0.contiguous() 2025-05-07T20:32:35.3839996Z x1 = x1.contiguous() 2025-05-07T20:32:35.3840065Z 2025-05-07T20:32:35.3840158Z if scale_ub is not None: 2025-05-07T20:32:35.3840265Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3840395Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3840465Z ) 2025-05-07T20:32:35.3840541Z else: 2025-05-07T20:32:35.3840635Z scale_ub_tensor = None 2025-05-07T20:32:35.3840705Z 2025-05-07T20:32:35.3840839Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3840971Z op = silu_mul_quant 2025-05-07T20:32:35.3841064Z if compiled: 2025-05-07T20:32:35.3841160Z op = torch.compile(op) 2025-05-07T20:32:35.3841263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3841339Z 2025-05-07T20:32:35.3841428Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3841433Z 2025-05-07T20:32:35.3841526Z moe/activation_test.py:117: 2025-05-07T20:32:35.3841660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3841760Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3841859Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3842355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3842458Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3842815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3843033Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3843364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3843454Z kernel = self.compile( 2025-05-07T20:32:35.3843827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3844004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3844128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3844132Z 2025-05-07T20:32:35.3844331Z self = 2025-05-07T20:32:35.3845100Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3845623Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016b70ae0>} 2025-05-07T20:32:35.3846383Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3846572Z context = 2025-05-07T20:32:35.3846577Z 2025-05-07T20:32:35.3846738Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3846995Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3847170Z module_map=module_map) 2025-05-07T20:32:35.3847370Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3847470Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3847542Z E ^ 2025-05-07T20:32:35.3847891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3847896Z 2025-05-07T20:32:35.3848301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3848344Z 2025-05-07T20:32:35.3848446Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3848664Z self=, 2025-05-07T20:32:35.3848739Z T=2048, 2025-05-07T20:32:35.3848818Z D=7168, 2025-05-07T20:32:35.3848899Z scale_ub=1200.0, 2025-05-07T20:32:35.3848980Z contiguous=True, 2025-05-07T20:32:35.3849069Z compiled=False, 2025-05-07T20:32:35.3849143Z ) 2025-05-07T20:32:35.3849361Z self = 2025-05-07T20:32:35.3849579Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3849584Z 2025-05-07T20:32:35.3849656Z @given( 2025-05-07T20:32:35.3849773Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3849871Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3849984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3850102Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3850213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3850284Z ) 2025-05-07T20:32:35.3850526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3850618Z def test_silu_mul_quant( 2025-05-07T20:32:35.3850691Z self, 2025-05-07T20:32:35.3850771Z T: int, 2025-05-07T20:32:35.3850845Z D: int, 2025-05-07T20:32:35.3850941Z scale_ub: Optional[float], 2025-05-07T20:32:35.3851032Z contiguous: bool, 2025-05-07T20:32:35.3851119Z compiled: bool, 2025-05-07T20:32:35.3851201Z ) -> None: 2025-05-07T20:32:35.3851292Z torch.manual_seed(2025) 2025-05-07T20:32:35.3851362Z 2025-05-07T20:32:35.3851531Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3853386Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3853397Z 2025-05-07T20:32:35.3853517Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3853523Z 2025-05-07T20:32:35.3853623Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3853841Z self=, 2025-05-07T20:32:35.3853917Z T=1, 2025-05-07T20:32:35.3853992Z D=5120, 2025-05-07T20:32:35.3854070Z scale_ub=1200.0, 2025-05-07T20:32:35.3854155Z contiguous=True, 2025-05-07T20:32:35.3854236Z compiled=False, 2025-05-07T20:32:35.3854314Z ) 2025-05-07T20:32:35.3854526Z self = 2025-05-07T20:32:35.3854688Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3854692Z 2025-05-07T20:32:35.3854773Z @given( 2025-05-07T20:32:35.3854890Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3854988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3855149Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3855300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3855412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3855489Z ) 2025-05-07T20:32:35.3855729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3855821Z def test_silu_mul_quant( 2025-05-07T20:32:35.3855898Z self, 2025-05-07T20:32:35.3855971Z T: int, 2025-05-07T20:32:35.3856049Z D: int, 2025-05-07T20:32:35.3856196Z scale_ub: Optional[float], 2025-05-07T20:32:35.3856283Z contiguous: bool, 2025-05-07T20:32:35.3856371Z compiled: bool, 2025-05-07T20:32:35.3856446Z ) -> None: 2025-05-07T20:32:35.3856536Z torch.manual_seed(2025) 2025-05-07T20:32:35.3856606Z 2025-05-07T20:32:35.3856767Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3856842Z 2025-05-07T20:32:35.3856932Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3857058Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3857194Z x = x_sign * x_clamp 2025-05-07T20:32:35.3857273Z x0 = x[:, :D] 2025-05-07T20:32:35.3857348Z x1 = x[:, D:] 2025-05-07T20:32:35.3857425Z 2025-05-07T20:32:35.3857506Z if contiguous: 2025-05-07T20:32:35.3857594Z x0 = x0.contiguous() 2025-05-07T20:32:35.3857683Z x1 = x1.contiguous() 2025-05-07T20:32:35.3857752Z 2025-05-07T20:32:35.3857844Z if scale_ub is not None: 2025-05-07T20:32:35.3857951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3858082Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3858156Z ) 2025-05-07T20:32:35.3858234Z else: 2025-05-07T20:32:35.3858326Z scale_ub_tensor = None 2025-05-07T20:32:35.3858400Z 2025-05-07T20:32:35.3858527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3858619Z op = silu_mul_quant 2025-05-07T20:32:35.3858704Z if compiled: 2025-05-07T20:32:35.3858806Z op = torch.compile(op) 2025-05-07T20:32:35.3858908Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3858976Z 2025-05-07T20:32:35.3859064Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3859069Z 2025-05-07T20:32:35.3859161Z moe/activation_test.py:117: 2025-05-07T20:32:35.3859547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3859668Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3859772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3860262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3860356Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3860711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3860935Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3861269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3861364Z kernel = self.compile( 2025-05-07T20:32:35.3861737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3861910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3862038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3862043Z 2025-05-07T20:32:35.3862247Z self = 2025-05-07T20:32:35.3863009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3863656Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016b720c0>} 2025-05-07T20:32:35.3864390Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3864574Z context = 2025-05-07T20:32:35.3864635Z 2025-05-07T20:32:35.3864799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3865058Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3865163Z module_map=module_map) 2025-05-07T20:32:35.3865328Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3865424Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3865498Z E ^ 2025-05-07T20:32:35.3865912Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3865917Z 2025-05-07T20:32:35.3866322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3866327Z 2025-05-07T20:32:35.3866429Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3866650Z self=, 2025-05-07T20:32:35.3866724Z T=2048, 2025-05-07T20:32:35.3866802Z D=5120, 2025-05-07T20:32:35.3866882Z scale_ub=None, 2025-05-07T20:32:35.3866964Z contiguous=True, 2025-05-07T20:32:35.3867050Z compiled=False, 2025-05-07T20:32:35.3867121Z ) 2025-05-07T20:32:35.3867333Z self = 2025-05-07T20:32:35.3867506Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3867513Z 2025-05-07T20:32:35.3867590Z @given( 2025-05-07T20:32:35.3867709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3867805Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3867917Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3868032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3868144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3868221Z ) 2025-05-07T20:32:35.3868462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3868552Z def test_silu_mul_quant( 2025-05-07T20:32:35.3868628Z self, 2025-05-07T20:32:35.3868706Z T: int, 2025-05-07T20:32:35.3868782Z D: int, 2025-05-07T20:32:35.3868881Z scale_ub: Optional[float], 2025-05-07T20:32:35.3868969Z contiguous: bool, 2025-05-07T20:32:35.3869051Z compiled: bool, 2025-05-07T20:32:35.3869129Z ) -> None: 2025-05-07T20:32:35.3869228Z torch.manual_seed(2025) 2025-05-07T20:32:35.3869298Z 2025-05-07T20:32:35.3869465Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3869534Z 2025-05-07T20:32:35.3869623Z > x_sign = torch.sign(x) 2025-05-07T20:32:35.3871375Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3871430Z 2025-05-07T20:32:35.3871547Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:35.3871552Z 2025-05-07T20:32:35.3871695Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3871913Z self=, 2025-05-07T20:32:35.3871997Z T=16384, 2025-05-07T20:32:35.3872074Z D=5120, 2025-05-07T20:32:35.3872154Z scale_ub=None, 2025-05-07T20:32:35.3872239Z contiguous=True, 2025-05-07T20:32:35.3872320Z compiled=False, 2025-05-07T20:32:35.3872389Z ) 2025-05-07T20:32:35.3872642Z self = 2025-05-07T20:32:35.3872813Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3872817Z 2025-05-07T20:32:35.3872890Z @given( 2025-05-07T20:32:35.3873009Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3873106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3873220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3873335Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3873512Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3873591Z ) 2025-05-07T20:32:35.3873832Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3873930Z def test_silu_mul_quant( 2025-05-07T20:32:35.3874003Z self, 2025-05-07T20:32:35.3874081Z T: int, 2025-05-07T20:32:35.3874160Z D: int, 2025-05-07T20:32:35.3874254Z scale_ub: Optional[float], 2025-05-07T20:32:35.3874343Z contiguous: bool, 2025-05-07T20:32:35.3874429Z compiled: bool, 2025-05-07T20:32:35.3874509Z ) -> None: 2025-05-07T20:32:35.3874601Z torch.manual_seed(2025) 2025-05-07T20:32:35.3874674Z 2025-05-07T20:32:35.3874836Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3876640Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3876649Z 2025-05-07T20:32:35.3876771Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3876776Z 2025-05-07T20:32:35.3876879Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3877093Z self=, 2025-05-07T20:32:35.3877168Z T=4096, 2025-05-07T20:32:35.3877247Z D=5120, 2025-05-07T20:32:35.3877324Z scale_ub=None, 2025-05-07T20:32:35.3877409Z contiguous=True, 2025-05-07T20:32:35.3877494Z compiled=False, 2025-05-07T20:32:35.3877563Z ) 2025-05-07T20:32:35.3877779Z self = 2025-05-07T20:32:35.3877951Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3877956Z 2025-05-07T20:32:35.3878027Z @given( 2025-05-07T20:32:35.3878147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3878245Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3878356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3878474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3878586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3878657Z ) 2025-05-07T20:32:35.3878899Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3878989Z def test_silu_mul_quant( 2025-05-07T20:32:35.3879067Z self, 2025-05-07T20:32:35.3879187Z T: int, 2025-05-07T20:32:35.3879259Z D: int, 2025-05-07T20:32:35.3879357Z scale_ub: Optional[float], 2025-05-07T20:32:35.3879485Z contiguous: bool, 2025-05-07T20:32:35.3879570Z compiled: bool, 2025-05-07T20:32:35.3879649Z ) -> None: 2025-05-07T20:32:35.3879741Z torch.manual_seed(2025) 2025-05-07T20:32:35.3879814Z 2025-05-07T20:32:35.3879983Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3881721Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3881768Z 2025-05-07T20:32:35.3881888Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3881931Z 2025-05-07T20:32:35.3882031Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3882250Z self=, 2025-05-07T20:32:35.3882326Z T=2048, 2025-05-07T20:32:35.3882402Z D=5120, 2025-05-07T20:32:35.3882483Z scale_ub=None, 2025-05-07T20:32:35.3882568Z contiguous=False, 2025-05-07T20:32:35.3882651Z compiled=False, 2025-05-07T20:32:35.3882725Z ) 2025-05-07T20:32:35.3882936Z self = 2025-05-07T20:32:35.3883103Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3883107Z 2025-05-07T20:32:35.3883184Z @given( 2025-05-07T20:32:35.3883299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3883398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3883513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3883629Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3883741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3883810Z ) 2025-05-07T20:32:35.3884050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3884142Z def test_silu_mul_quant( 2025-05-07T20:32:35.3884217Z self, 2025-05-07T20:32:35.3884294Z T: int, 2025-05-07T20:32:35.3884375Z D: int, 2025-05-07T20:32:35.3884471Z scale_ub: Optional[float], 2025-05-07T20:32:35.3884556Z contiguous: bool, 2025-05-07T20:32:35.3884642Z compiled: bool, 2025-05-07T20:32:35.3884719Z ) -> None: 2025-05-07T20:32:35.3884810Z torch.manual_seed(2025) 2025-05-07T20:32:35.3884882Z 2025-05-07T20:32:35.3885044Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3886784Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3886792Z 2025-05-07T20:32:35.3886906Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3886911Z 2025-05-07T20:32:35.3887013Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3887228Z self=, 2025-05-07T20:32:35.3887303Z T=4096, 2025-05-07T20:32:35.3887421Z D=7168, 2025-05-07T20:32:35.3887501Z scale_ub=None, 2025-05-07T20:32:35.3887585Z contiguous=True, 2025-05-07T20:32:35.3887664Z compiled=True, 2025-05-07T20:32:35.3887772Z ) 2025-05-07T20:32:35.3887988Z self = 2025-05-07T20:32:35.3888156Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3888160Z 2025-05-07T20:32:35.3888233Z @given( 2025-05-07T20:32:35.3888352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3888449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3888598Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3888716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3888828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3888900Z ) 2025-05-07T20:32:35.3889143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3889236Z def test_silu_mul_quant( 2025-05-07T20:32:35.3889315Z self, 2025-05-07T20:32:35.3889390Z T: int, 2025-05-07T20:32:35.3889463Z D: int, 2025-05-07T20:32:35.3889601Z scale_ub: Optional[float], 2025-05-07T20:32:35.3889690Z contiguous: bool, 2025-05-07T20:32:35.3889773Z compiled: bool, 2025-05-07T20:32:35.3889852Z ) -> None: 2025-05-07T20:32:35.3889943Z torch.manual_seed(2025) 2025-05-07T20:32:35.3890011Z 2025-05-07T20:32:35.3890178Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3891919Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3891930Z 2025-05-07T20:32:35.3892049Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3892053Z 2025-05-07T20:32:35.3892151Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3892374Z self=, 2025-05-07T20:32:35.3892448Z T=2048, 2025-05-07T20:32:35.3892521Z D=5120, 2025-05-07T20:32:35.3892603Z scale_ub=1200.0, 2025-05-07T20:32:35.3892690Z contiguous=False, 2025-05-07T20:32:35.3892770Z compiled=False, 2025-05-07T20:32:35.3892845Z ) 2025-05-07T20:32:35.3893110Z self = 2025-05-07T20:32:35.3893280Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.3893285Z 2025-05-07T20:32:35.3893360Z @given( 2025-05-07T20:32:35.3893477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3893572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3893689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3893800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3893913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3893983Z ) 2025-05-07T20:32:35.3894220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3894312Z def test_silu_mul_quant( 2025-05-07T20:32:35.3894387Z self, 2025-05-07T20:32:35.3894459Z T: int, 2025-05-07T20:32:35.3894536Z D: int, 2025-05-07T20:32:35.3894630Z scale_ub: Optional[float], 2025-05-07T20:32:35.3894716Z contiguous: bool, 2025-05-07T20:32:35.3894803Z compiled: bool, 2025-05-07T20:32:35.3894877Z ) -> None: 2025-05-07T20:32:35.3894971Z torch.manual_seed(2025) 2025-05-07T20:32:35.3895093Z 2025-05-07T20:32:35.3895255Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3897080Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3897123Z 2025-05-07T20:32:35.3897237Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3897241Z 2025-05-07T20:32:35.3897343Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3897559Z self=, 2025-05-07T20:32:35.3897633Z T=4096, 2025-05-07T20:32:35.3897710Z D=7168, 2025-05-07T20:32:35.3897791Z scale_ub=1200.0, 2025-05-07T20:32:35.3897877Z contiguous=True, 2025-05-07T20:32:35.3897997Z compiled=False, 2025-05-07T20:32:35.3898071Z ) 2025-05-07T20:32:35.3898279Z self = 2025-05-07T20:32:35.3898452Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3898456Z 2025-05-07T20:32:35.3898530Z @given( 2025-05-07T20:32:35.3898644Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3898743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3898853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3898972Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3899083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3899156Z ) 2025-05-07T20:32:35.3899400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3899490Z def test_silu_mul_quant( 2025-05-07T20:32:35.3899570Z self, 2025-05-07T20:32:35.3899650Z T: int, 2025-05-07T20:32:35.3899723Z D: int, 2025-05-07T20:32:35.3899819Z scale_ub: Optional[float], 2025-05-07T20:32:35.3899906Z contiguous: bool, 2025-05-07T20:32:35.3899987Z compiled: bool, 2025-05-07T20:32:35.3900066Z ) -> None: 2025-05-07T20:32:35.3900156Z torch.manual_seed(2025) 2025-05-07T20:32:35.3900224Z 2025-05-07T20:32:35.3900389Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3902132Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3902140Z 2025-05-07T20:32:35.3902256Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3902260Z 2025-05-07T20:32:35.3902357Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3902575Z self=, 2025-05-07T20:32:35.3902648Z T=16384, 2025-05-07T20:32:35.3902723Z D=7168, 2025-05-07T20:32:35.3902804Z scale_ub=None, 2025-05-07T20:32:35.3902887Z contiguous=False, 2025-05-07T20:32:35.3902968Z compiled=True, 2025-05-07T20:32:35.3903042Z ) 2025-05-07T20:32:35.3903253Z self = 2025-05-07T20:32:35.3903423Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3903472Z 2025-05-07T20:32:35.3903552Z @given( 2025-05-07T20:32:35.3903666Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3903805Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3903920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3904032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3904148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3904221Z ) 2025-05-07T20:32:35.3904460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3904614Z def test_silu_mul_quant( 2025-05-07T20:32:35.3904689Z self, 2025-05-07T20:32:35.3904765Z T: int, 2025-05-07T20:32:35.3904845Z D: int, 2025-05-07T20:32:35.3904941Z scale_ub: Optional[float], 2025-05-07T20:32:35.3905027Z contiguous: bool, 2025-05-07T20:32:35.3905113Z compiled: bool, 2025-05-07T20:32:35.3905185Z ) -> None: 2025-05-07T20:32:35.3905279Z torch.manual_seed(2025) 2025-05-07T20:32:35.3905355Z 2025-05-07T20:32:35.3905558Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3907299Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3907308Z 2025-05-07T20:32:35.3907420Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3907424Z 2025-05-07T20:32:35.3907523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3907742Z self=, 2025-05-07T20:32:35.3907814Z T=4096, 2025-05-07T20:32:35.3907890Z D=7168, 2025-05-07T20:32:35.3907975Z scale_ub=None, 2025-05-07T20:32:35.3908054Z contiguous=True, 2025-05-07T20:32:35.3908137Z compiled=False, 2025-05-07T20:32:35.3908210Z ) 2025-05-07T20:32:35.3908419Z self = 2025-05-07T20:32:35.3908586Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3908591Z 2025-05-07T20:32:35.3908667Z @given( 2025-05-07T20:32:35.3908784Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3908879Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3908989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3909104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3909213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3909285Z ) 2025-05-07T20:32:35.3909526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3909618Z def test_silu_mul_quant( 2025-05-07T20:32:35.3909697Z self, 2025-05-07T20:32:35.3909774Z T: int, 2025-05-07T20:32:35.3909850Z D: int, 2025-05-07T20:32:35.3909948Z scale_ub: Optional[float], 2025-05-07T20:32:35.3910034Z contiguous: bool, 2025-05-07T20:32:35.3910116Z compiled: bool, 2025-05-07T20:32:35.3910198Z ) -> None: 2025-05-07T20:32:35.3910289Z torch.manual_seed(2025) 2025-05-07T20:32:35.3910364Z 2025-05-07T20:32:35.3910528Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3912305Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3912353Z 2025-05-07T20:32:35.3912471Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3912475Z 2025-05-07T20:32:35.3912572Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3912790Z self=, 2025-05-07T20:32:35.3912902Z T=16384, 2025-05-07T20:32:35.3912977Z D=7168, 2025-05-07T20:32:35.3913063Z scale_ub=None, 2025-05-07T20:32:35.3913145Z contiguous=True, 2025-05-07T20:32:35.3913229Z compiled=False, 2025-05-07T20:32:35.3913305Z ) 2025-05-07T20:32:35.3913514Z self = 2025-05-07T20:32:35.3913687Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:35.3913691Z 2025-05-07T20:32:35.3913768Z @given( 2025-05-07T20:32:35.3913922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3914019Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3914134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3914247Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3914360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3914432Z ) 2025-05-07T20:32:35.3914676Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3914774Z def test_silu_mul_quant( 2025-05-07T20:32:35.3914847Z self, 2025-05-07T20:32:35.3914922Z T: int, 2025-05-07T20:32:35.3914999Z D: int, 2025-05-07T20:32:35.3915095Z scale_ub: Optional[float], 2025-05-07T20:32:35.3915179Z contiguous: bool, 2025-05-07T20:32:35.3915268Z compiled: bool, 2025-05-07T20:32:35.3915345Z ) -> None: 2025-05-07T20:32:35.3915436Z torch.manual_seed(2025) 2025-05-07T20:32:35.3915513Z 2025-05-07T20:32:35.3915678Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3917423Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3917431Z 2025-05-07T20:32:35.3917545Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3917550Z 2025-05-07T20:32:35.3917657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3917871Z self=, 2025-05-07T20:32:35.3917948Z T=16384, 2025-05-07T20:32:35.3918025Z D=7168, 2025-05-07T20:32:35.3918104Z scale_ub=1200.0, 2025-05-07T20:32:35.3918185Z contiguous=True, 2025-05-07T20:32:35.3918265Z compiled=False, 2025-05-07T20:32:35.3918337Z ) 2025-05-07T20:32:35.3918546Z self = 2025-05-07T20:32:35.3918720Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3918726Z 2025-05-07T20:32:35.3918801Z @given( 2025-05-07T20:32:35.3918917Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3919015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3919125Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3919238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3919393Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3919466Z ) 2025-05-07T20:32:35.3919750Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3919842Z def test_silu_mul_quant( 2025-05-07T20:32:35.3919918Z self, 2025-05-07T20:32:35.3919993Z T: int, 2025-05-07T20:32:35.3920066Z D: int, 2025-05-07T20:32:35.3920165Z scale_ub: Optional[float], 2025-05-07T20:32:35.3920253Z contiguous: bool, 2025-05-07T20:32:35.3920335Z compiled: bool, 2025-05-07T20:32:35.3920457Z ) -> None: 2025-05-07T20:32:35.3920546Z torch.manual_seed(2025) 2025-05-07T20:32:35.3920620Z 2025-05-07T20:32:35.3920786Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3922565Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3922574Z 2025-05-07T20:32:35.3922690Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3922695Z 2025-05-07T20:32:35.3922794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3923018Z self=, 2025-05-07T20:32:35.3923093Z T=128, 2025-05-07T20:32:35.3923166Z D=5120, 2025-05-07T20:32:35.3923249Z scale_ub=1200.0, 2025-05-07T20:32:35.3923335Z contiguous=False, 2025-05-07T20:32:35.3923418Z compiled=False, 2025-05-07T20:32:35.3923488Z ) 2025-05-07T20:32:35.3923697Z self = 2025-05-07T20:32:35.3923865Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.3923874Z 2025-05-07T20:32:35.3923949Z @given( 2025-05-07T20:32:35.3924063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3924159Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3924272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3924383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3924496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3924572Z ) 2025-05-07T20:32:35.3924810Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3924904Z def test_silu_mul_quant( 2025-05-07T20:32:35.3924978Z self, 2025-05-07T20:32:35.3925052Z T: int, 2025-05-07T20:32:35.3925129Z D: int, 2025-05-07T20:32:35.3925225Z scale_ub: Optional[float], 2025-05-07T20:32:35.3925314Z contiguous: bool, 2025-05-07T20:32:35.3925398Z compiled: bool, 2025-05-07T20:32:35.3925473Z ) -> None: 2025-05-07T20:32:35.3925569Z torch.manual_seed(2025) 2025-05-07T20:32:35.3925644Z 2025-05-07T20:32:35.3925806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3925881Z 2025-05-07T20:32:35.3925971Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3926094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3926183Z x = x_sign * x_clamp 2025-05-07T20:32:35.3926264Z x0 = x[:, :D] 2025-05-07T20:32:35.3926340Z x1 = x[:, D:] 2025-05-07T20:32:35.3926409Z 2025-05-07T20:32:35.3926490Z if contiguous: 2025-05-07T20:32:35.3926578Z x0 = x0.contiguous() 2025-05-07T20:32:35.3926666Z x1 = x1.contiguous() 2025-05-07T20:32:35.3926737Z 2025-05-07T20:32:35.3926825Z if scale_ub is not None: 2025-05-07T20:32:35.3926978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3927109Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3927223Z ) 2025-05-07T20:32:35.3927301Z else: 2025-05-07T20:32:35.3927395Z scale_ub_tensor = None 2025-05-07T20:32:35.3927474Z 2025-05-07T20:32:35.3927601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3927689Z op = silu_mul_quant 2025-05-07T20:32:35.3927773Z if compiled: 2025-05-07T20:32:35.3927870Z op = torch.compile(op) 2025-05-07T20:32:35.3928012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3928082Z 2025-05-07T20:32:35.3928169Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3928174Z 2025-05-07T20:32:35.3928267Z moe/activation_test.py:117: 2025-05-07T20:32:35.3928396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3928495Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3928597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3929131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3929227Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3929582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3929799Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3930136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3930230Z kernel = self.compile( 2025-05-07T20:32:35.3930605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3930781Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3930908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3930912Z 2025-05-07T20:32:35.3931118Z self = 2025-05-07T20:32:35.3931882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3932377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016850cc0>} 2025-05-07T20:32:35.3933163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3933349Z context = 2025-05-07T20:32:35.3933356Z 2025-05-07T20:32:35.3933521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3933782Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3933886Z module_map=module_map) 2025-05-07T20:32:35.3934048Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3934143Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3934217Z E ^ 2025-05-07T20:32:35.3934569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3934577Z 2025-05-07T20:32:35.3934981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3934985Z 2025-05-07T20:32:35.3935097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3939478Z self=, 2025-05-07T20:32:35.3939643Z T=2048, 2025-05-07T20:32:35.3939721Z D=7168, 2025-05-07T20:32:35.3939849Z scale_ub=None, 2025-05-07T20:32:35.3939943Z contiguous=False, 2025-05-07T20:32:35.3940023Z compiled=False, 2025-05-07T20:32:35.3940093Z ) 2025-05-07T20:32:35.3940314Z self = 2025-05-07T20:32:35.3940488Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:35.3940493Z 2025-05-07T20:32:35.3940567Z @given( 2025-05-07T20:32:35.3940728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3940832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3940941Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3941055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3941167Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3941241Z ) 2025-05-07T20:32:35.3941490Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3941583Z def test_silu_mul_quant( 2025-05-07T20:32:35.3941698Z self, 2025-05-07T20:32:35.3941781Z T: int, 2025-05-07T20:32:35.3941856Z D: int, 2025-05-07T20:32:35.3941952Z scale_ub: Optional[float], 2025-05-07T20:32:35.3942044Z contiguous: bool, 2025-05-07T20:32:35.3942129Z compiled: bool, 2025-05-07T20:32:35.3942205Z ) -> None: 2025-05-07T20:32:35.3942300Z torch.manual_seed(2025) 2025-05-07T20:32:35.3942372Z 2025-05-07T20:32:35.3942543Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3944313Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3944321Z 2025-05-07T20:32:35.3944436Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3944440Z 2025-05-07T20:32:35.3944544Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3944761Z self=, 2025-05-07T20:32:35.3944844Z T=128, 2025-05-07T20:32:35.3944915Z D=7168, 2025-05-07T20:32:35.3944996Z scale_ub=1200.0, 2025-05-07T20:32:35.3945077Z contiguous=True, 2025-05-07T20:32:35.3945157Z compiled=True, 2025-05-07T20:32:35.3945228Z ) 2025-05-07T20:32:35.3945447Z self = 2025-05-07T20:32:35.3945608Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.3945616Z 2025-05-07T20:32:35.3945688Z @given( 2025-05-07T20:32:35.3945809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3945910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3946024Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3946137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3946249Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3946327Z ) 2025-05-07T20:32:35.3946566Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3946660Z def test_silu_mul_quant( 2025-05-07T20:32:35.3946734Z self, 2025-05-07T20:32:35.3946806Z T: int, 2025-05-07T20:32:35.3946882Z D: int, 2025-05-07T20:32:35.3946981Z scale_ub: Optional[float], 2025-05-07T20:32:35.3947068Z contiguous: bool, 2025-05-07T20:32:35.3947150Z compiled: bool, 2025-05-07T20:32:35.3947276Z ) -> None: 2025-05-07T20:32:35.3947367Z torch.manual_seed(2025) 2025-05-07T20:32:35.3947444Z 2025-05-07T20:32:35.3947650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3947724Z 2025-05-07T20:32:35.3947815Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3947939Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3948026Z x = x_sign * x_clamp 2025-05-07T20:32:35.3948112Z x0 = x[:, :D] 2025-05-07T20:32:35.3948190Z x1 = x[:, D:] 2025-05-07T20:32:35.3948259Z 2025-05-07T20:32:35.3948385Z if contiguous: 2025-05-07T20:32:35.3948476Z x0 = x0.contiguous() 2025-05-07T20:32:35.3948562Z x1 = x1.contiguous() 2025-05-07T20:32:35.3948637Z 2025-05-07T20:32:35.3948724Z if scale_ub is not None: 2025-05-07T20:32:35.3948826Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3948960Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3949037Z ) 2025-05-07T20:32:35.3949110Z else: 2025-05-07T20:32:35.3949203Z scale_ub_tensor = None 2025-05-07T20:32:35.3949312Z 2025-05-07T20:32:35.3949451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3949537Z op = silu_mul_quant 2025-05-07T20:32:35.3949619Z if compiled: 2025-05-07T20:32:35.3949719Z op = torch.compile(op) 2025-05-07T20:32:35.3949823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3949894Z 2025-05-07T20:32:35.3949987Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3949991Z 2025-05-07T20:32:35.3950085Z moe/activation_test.py:117: 2025-05-07T20:32:35.3950217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3950317Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3950413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3950785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3950880Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3951371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3951471Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3951820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3952041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3952378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3952473Z kernel = self.compile( 2025-05-07T20:32:35.3952852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3953023Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3953150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3953162Z 2025-05-07T20:32:35.3953365Z self = 2025-05-07T20:32:35.3954130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3954626Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fb016851a80>} 2025-05-07T20:32:35.3955360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3955594Z context = 2025-05-07T20:32:35.3955599Z 2025-05-07T20:32:35.3955798Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3956056Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3956164Z module_map=module_map) 2025-05-07T20:32:35.3956324Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3956421Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3956498Z E ^ 2025-05-07T20:32:35.3956886Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3956892Z 2025-05-07T20:32:35.3957296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3957301Z 2025-05-07T20:32:35.3957399Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3957618Z self=, 2025-05-07T20:32:35.3957695Z T=128, 2025-05-07T20:32:35.3957810Z D=7168, 2025-05-07T20:32:35.3957895Z scale_ub=1200.0, 2025-05-07T20:32:35.3957977Z contiguous=True, 2025-05-07T20:32:35.3958061Z compiled=False, 2025-05-07T20:32:35.3958135Z ) 2025-05-07T20:32:35.3958346Z self = 2025-05-07T20:32:35.3958510Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.3958518Z 2025-05-07T20:32:35.3958594Z @given( 2025-05-07T20:32:35.3958712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3958810Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3958923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3959036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3959151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3959452Z ) 2025-05-07T20:32:35.3959767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3959863Z def test_silu_mul_quant( 2025-05-07T20:32:35.3959938Z self, 2025-05-07T20:32:35.3960013Z T: int, 2025-05-07T20:32:35.3960092Z D: int, 2025-05-07T20:32:35.3960188Z scale_ub: Optional[float], 2025-05-07T20:32:35.3960275Z contiguous: bool, 2025-05-07T20:32:35.3960362Z compiled: bool, 2025-05-07T20:32:35.3960440Z ) -> None: 2025-05-07T20:32:35.3960538Z torch.manual_seed(2025) 2025-05-07T20:32:35.3960615Z 2025-05-07T20:32:35.3960780Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3960855Z 2025-05-07T20:32:35.3960945Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3961067Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3962827Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3962835Z 2025-05-07T20:32:35.3962948Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.3962957Z 2025-05-07T20:32:35.3963061Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3963276Z self=, 2025-05-07T20:32:35.3963352Z T=128, 2025-05-07T20:32:35.3963432Z D=5120, 2025-05-07T20:32:35.3963516Z scale_ub=1200.0, 2025-05-07T20:32:35.3963600Z contiguous=True, 2025-05-07T20:32:35.3963774Z compiled=True, 2025-05-07T20:32:35.3963848Z ) 2025-05-07T20:32:35.3964125Z self = 2025-05-07T20:32:35.3964293Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.3964297Z 2025-05-07T20:32:35.3964369Z @given( 2025-05-07T20:32:35.3964486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3964584Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3964692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3964894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3965003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3965078Z ) 2025-05-07T20:32:35.3965320Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3965409Z def test_silu_mul_quant( 2025-05-07T20:32:35.3965488Z self, 2025-05-07T20:32:35.3965562Z T: int, 2025-05-07T20:32:35.3965634Z D: int, 2025-05-07T20:32:35.3965730Z scale_ub: Optional[float], 2025-05-07T20:32:35.3965819Z contiguous: bool, 2025-05-07T20:32:35.3965969Z compiled: bool, 2025-05-07T20:32:35.3966051Z ) -> None: 2025-05-07T20:32:35.3966143Z torch.manual_seed(2025) 2025-05-07T20:32:35.3966213Z 2025-05-07T20:32:35.3966378Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3966449Z 2025-05-07T20:32:35.3966539Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3966663Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3968408Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3968416Z 2025-05-07T20:32:35.3968533Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:35.3968538Z 2025-05-07T20:32:35.3968636Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3968854Z self=, 2025-05-07T20:32:35.3968928Z T=128, 2025-05-07T20:32:35.3969004Z D=7168, 2025-05-07T20:32:35.3969087Z scale_ub=None, 2025-05-07T20:32:35.3969168Z contiguous=True, 2025-05-07T20:32:35.3969250Z compiled=True, 2025-05-07T20:32:35.3969326Z ) 2025-05-07T20:32:35.3969537Z self = 2025-05-07T20:32:35.3969700Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:35.3969711Z 2025-05-07T20:32:35.3969784Z @given( 2025-05-07T20:32:35.3969898Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3970001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3970111Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3970224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3970337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3970410Z ) 2025-05-07T20:32:35.3970651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3970748Z def test_silu_mul_quant( 2025-05-07T20:32:35.3970820Z self, 2025-05-07T20:32:35.3970892Z T: int, 2025-05-07T20:32:35.3970967Z D: int, 2025-05-07T20:32:35.3971063Z scale_ub: Optional[float], 2025-05-07T20:32:35.3971153Z contiguous: bool, 2025-05-07T20:32:35.3971237Z compiled: bool, 2025-05-07T20:32:35.3971313Z ) -> None: 2025-05-07T20:32:35.3971453Z torch.manual_seed(2025) 2025-05-07T20:32:35.3971526Z 2025-05-07T20:32:35.3971725Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3973527Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:35.3973573Z 2025-05-07T20:32:35.3973724Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:35.3973859Z =============================== warnings summary =============================== 2025-05-07T20:32:35.3974163Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:35.3974499Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:35.3974793Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:35.3975651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:35.3975887Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:35.3975892Z 2025-05-07T20:32:35.3976103Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:35.3976273Z ================= 1 failed, 1 deselected, 3 warnings in 15.08s ================= 2025-05-07T20:32:36.9684307Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:37.0302618Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:37.0302843Z 2025-05-07T20:32:39.0321060Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:41.1868365Z ============================= test session starts ============================== 2025-05-07T20:32:41.1869167Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:41.1869745Z cachedir: .pytest_cache 2025-05-07T20:32:41.1870313Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:41.1871044Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:41.1871449Z plugins: hypothesis-6.131.14 2025-05-07T20:32:42.8017523Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:42.9121666Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:42.9122073Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:42.9122288Z 2025-05-07T20:32:45.2680248Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.2680970Z self=, 2025-05-07T20:32:45.2681385Z T=1, 2025-05-07T20:32:45.2681573Z D=5120, 2025-05-07T20:32:45.2681778Z scale_ub=None, 2025-05-07T20:32:45.2681995Z contiguous=True, 2025-05-07T20:32:45.2682218Z compiled=True, 2025-05-07T20:32:45.2682435Z ) 2025-05-07T20:32:45.2682758Z self = 2025-05-07T20:32:45.2683633Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:45.2683992Z 2025-05-07T20:32:45.2684079Z @given( 2025-05-07T20:32:45.2684316Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.2684633Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.2684936Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.2685266Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.2685595Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.2685976Z ) 2025-05-07T20:32:45.2686326Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.2686786Z def test_silu_mul_quant( 2025-05-07T20:32:45.2687029Z self, 2025-05-07T20:32:45.2687234Z T: int, 2025-05-07T20:32:45.2687440Z D: int, 2025-05-07T20:32:45.2687656Z scale_ub: Optional[float], 2025-05-07T20:32:45.2687930Z contiguous: bool, 2025-05-07T20:32:45.2688173Z compiled: bool, 2025-05-07T20:32:45.2688397Z ) -> None: 2025-05-07T20:32:45.2688706Z torch.manual_seed(2025) 2025-05-07T20:32:45.2688958Z 2025-05-07T20:32:45.2689231Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.2689575Z 2025-05-07T20:32:45.2689779Z x_sign = torch.sign(x) 2025-05-07T20:32:45.2690067Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.2690379Z x = x_sign * x_clamp 2025-05-07T20:32:45.2690628Z x0 = x[:, :D] 2025-05-07T20:32:45.2690841Z x1 = x[:, D:] 2025-05-07T20:32:45.2691059Z 2025-05-07T20:32:45.2691250Z if contiguous: 2025-05-07T20:32:45.2691477Z x0 = x0.contiguous() 2025-05-07T20:32:45.2691735Z x1 = x1.contiguous() 2025-05-07T20:32:45.2691979Z 2025-05-07T20:32:45.2692173Z if scale_ub is not None: 2025-05-07T20:32:45.2692445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.2692783Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.2693249Z ) 2025-05-07T20:32:45.2693440Z else: 2025-05-07T20:32:45.2693656Z scale_ub_tensor = None 2025-05-07T20:32:45.2693909Z 2025-05-07T20:32:45.2694139Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.2694454Z op = silu_mul_quant 2025-05-07T20:32:45.2694704Z if compiled: 2025-05-07T20:32:45.2694949Z op = torch.compile(op) 2025-05-07T20:32:45.2695254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.2695530Z 2025-05-07T20:32:45.2695715Z y_fp8, y_scale = fn() 2025-05-07T20:32:45.2696006Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:45.2696296Z 2025-05-07T20:32:45.2696532Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.2696870Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:45.2697166Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:45.2697484Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:45.2697848Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.2698163Z 2025-05-07T20:32:45.2698368Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:45.2698561Z 2025-05-07T20:32:45.2698663Z moe/activation_test.py:126: 2025-05-07T20:32:45.2698964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.2699301Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:45.2699631Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.2700417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:45.2701167Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:45.2701714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.2702488Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.2703176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:45.2703894Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.2704618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:45.2705297Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:45.2705898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:45.2706413Z fn() 2025-05-07T20:32:45.2706923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:45.2707499Z self.fn.run( 2025-05-07T20:32:45.2707971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.2708551Z kernel = self.compile( 2025-05-07T20:32:45.2709086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.2709735Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.2710186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.2710423Z 2025-05-07T20:32:45.2710636Z self = 2025-05-07T20:32:45.2711709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.2713103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc89114dc60>} 2025-05-07T20:32:45.2714432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.2715448Z context = 2025-05-07T20:32:45.2715738Z 2025-05-07T20:32:45.2715905Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.2716429Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.2716898Z module_map=module_map) 2025-05-07T20:32:45.2717269Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.2717622Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:45.2717897Z E ^ 2025-05-07T20:32:45.2718366Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.2718813Z 2025-05-07T20:32:45.2719235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.2719743Z 2025-05-07T20:32:45.2719848Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.2720293Z self=, 2025-05-07T20:32:45.2720722Z T=2048, 2025-05-07T20:32:45.2720918Z D=5120, 2025-05-07T20:32:45.2721117Z scale_ub=1200.0, 2025-05-07T20:32:45.2721344Z contiguous=True, 2025-05-07T20:32:45.2721564Z compiled=False, 2025-05-07T20:32:45.2721777Z ) 2025-05-07T20:32:46.0045162Z self = 2025-05-07T20:32:46.0045932Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:46.0046625Z 2025-05-07T20:32:46.0046734Z @given( 2025-05-07T20:32:46.0047052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.0047627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.0047962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.0048293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.0048625Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.0048910Z ) 2025-05-07T20:32:46.0049270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.0049820Z def test_silu_mul_quant( 2025-05-07T20:32:46.0050063Z self, 2025-05-07T20:32:46.0050254Z T: int, 2025-05-07T20:32:46.0050456Z D: int, 2025-05-07T20:32:46.0050678Z scale_ub: Optional[float], 2025-05-07T20:32:46.0050950Z contiguous: bool, 2025-05-07T20:32:46.0051197Z compiled: bool, 2025-05-07T20:32:46.0051437Z ) -> None: 2025-05-07T20:32:46.0051654Z torch.manual_seed(2025) 2025-05-07T20:32:46.0051902Z 2025-05-07T20:32:46.0052288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.0052634Z 2025-05-07T20:32:46.0052837Z x_sign = torch.sign(x) 2025-05-07T20:32:46.0053257Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.0053566Z x = x_sign * x_clamp 2025-05-07T20:32:46.0053819Z x0 = x[:, :D] 2025-05-07T20:32:46.0054040Z x1 = x[:, D:] 2025-05-07T20:32:46.0054241Z 2025-05-07T20:32:46.0054438Z if contiguous: 2025-05-07T20:32:46.0054674Z x0 = x0.contiguous() 2025-05-07T20:32:46.0054928Z x1 = x1.contiguous() 2025-05-07T20:32:46.0055172Z 2025-05-07T20:32:46.0055379Z if scale_ub is not None: 2025-05-07T20:32:46.0055656Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.0055992Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.0056306Z ) 2025-05-07T20:32:46.0056508Z else: 2025-05-07T20:32:46.0056716Z scale_ub_tensor = None 2025-05-07T20:32:46.0056972Z 2025-05-07T20:32:46.0057209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.0057522Z op = silu_mul_quant 2025-05-07T20:32:46.0057778Z if compiled: 2025-05-07T20:32:46.0058028Z op = torch.compile(op) 2025-05-07T20:32:46.0058318Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.0058596Z 2025-05-07T20:32:46.0058792Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.0058960Z 2025-05-07T20:32:46.0059060Z moe/activation_test.py:117: 2025-05-07T20:32:46.0059791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.0060135Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.0060419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.0061110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.0061801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.0062346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.0063022Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.0063684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.0064215Z kernel = self.compile( 2025-05-07T20:32:46.0064764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.0065405Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.0065807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.0066033Z 2025-05-07T20:32:46.0066245Z self = 2025-05-07T20:32:46.0067474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.0068839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc890db0220>} 2025-05-07T20:32:46.0070169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.0071251Z context = 2025-05-07T20:32:46.0071535Z 2025-05-07T20:32:46.0071708Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.0072222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.0072696Z module_map=module_map) 2025-05-07T20:32:46.0073129Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.0073486Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.0073745Z E ^ 2025-05-07T20:32:46.0074211Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.0074655Z 2025-05-07T20:32:46.0075073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.0075586Z 2025-05-07T20:32:46.0075697Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.0076104Z self=, 2025-05-07T20:32:46.0076509Z T=2048, 2025-05-07T20:32:46.0076705Z D=5120, 2025-05-07T20:32:46.0076899Z scale_ub=1200.0, 2025-05-07T20:32:46.0077127Z contiguous=True, 2025-05-07T20:32:46.0077357Z compiled=True, 2025-05-07T20:32:46.0077562Z ) 2025-05-07T20:32:46.0077889Z self = 2025-05-07T20:32:46.0078386Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:46.0078651Z 2025-05-07T20:32:46.0078732Z @given( 2025-05-07T20:32:46.0078971Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.0079285Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.0079605Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.0079979Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.0080311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.0080603Z ) 2025-05-07T20:32:46.0080949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.0081388Z def test_silu_mul_quant( 2025-05-07T20:32:46.0081630Z self, 2025-05-07T20:32:46.0081824Z T: int, 2025-05-07T20:32:46.0082027Z D: int, 2025-05-07T20:32:46.0082252Z scale_ub: Optional[float], 2025-05-07T20:32:46.0082521Z contiguous: bool, 2025-05-07T20:32:46.0082765Z compiled: bool, 2025-05-07T20:32:46.0083007Z ) -> None: 2025-05-07T20:32:46.0090439Z torch.manual_seed(2025) 2025-05-07T20:32:46.0090714Z 2025-05-07T20:32:46.0091103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.0091571Z 2025-05-07T20:32:46.0091852Z x_sign = torch.sign(x) 2025-05-07T20:32:46.0092189Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.0092498Z x = x_sign * x_clamp 2025-05-07T20:32:46.0092743Z x0 = x[:, :D] 2025-05-07T20:32:46.0092964Z x1 = x[:, D:] 2025-05-07T20:32:46.0093228Z 2025-05-07T20:32:46.0093446Z if contiguous: 2025-05-07T20:32:46.0093682Z x0 = x0.contiguous() 2025-05-07T20:32:46.0094036Z x1 = x1.contiguous() 2025-05-07T20:32:46.0094267Z 2025-05-07T20:32:46.0094465Z if scale_ub is not None: 2025-05-07T20:32:46.0094795Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.0095137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.0095445Z ) 2025-05-07T20:32:46.0095653Z else: 2025-05-07T20:32:46.0095871Z scale_ub_tensor = None 2025-05-07T20:32:46.0096118Z 2025-05-07T20:32:46.0096357Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.0096724Z op = silu_mul_quant 2025-05-07T20:32:46.0096976Z if compiled: 2025-05-07T20:32:46.0097235Z op = torch.compile(op) 2025-05-07T20:32:46.0097536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.0097811Z 2025-05-07T20:32:46.0098012Z y_fp8, y_scale = fn() 2025-05-07T20:32:46.0098312Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:46.0098603Z 2025-05-07T20:32:46.0098852Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.0099241Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:46.0099542Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:46.0099860Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:46.0100229Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.0100546Z 2025-05-07T20:32:46.0100754Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:46.0100953Z 2025-05-07T20:32:46.0101058Z moe/activation_test.py:126: 2025-05-07T20:32:46.0101363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.0101692Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:46.0102021Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.0102809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:46.0103558Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:46.0104102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.0104780Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.0105463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:46.0106180Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:46.0106895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:46.0107533Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:46.0108131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:46.0108636Z fn() 2025-05-07T20:32:46.0109144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:46.0109722Z self.fn.run( 2025-05-07T20:32:46.0110193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.0110709Z kernel = self.compile( 2025-05-07T20:32:46.0111247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.0111899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.0112287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.0112521Z 2025-05-07T20:32:46.0112730Z self = 2025-05-07T20:32:46.0113848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.0115246Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc890db16c0>} 2025-05-07T20:32:46.0116568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.0117615Z context = 2025-05-07T20:32:46.0117907Z 2025-05-07T20:32:46.0118071Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.0118589Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.0119054Z module_map=module_map) 2025-05-07T20:32:46.0119419Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.0119791Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:46.0120142Z E ^ 2025-05-07T20:32:46.0120601Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.0121055Z 2025-05-07T20:32:46.0121467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.0121981Z 2025-05-07T20:32:46.0122090Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.0122514Z self=, 2025-05-07T20:32:46.0122910Z T=16384, 2025-05-07T20:32:46.0123112Z D=7168, 2025-05-07T20:32:46.0123315Z scale_ub=1200.0, 2025-05-07T20:32:46.0123540Z contiguous=False, 2025-05-07T20:32:46.0123777Z compiled=False, 2025-05-07T20:32:46.0123990Z ) 2025-05-07T20:32:46.7388674Z self = 2025-05-07T20:32:46.7389428Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:46.7389719Z 2025-05-07T20:32:46.7389803Z @given( 2025-05-07T20:32:46.7390054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7390372Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7390693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7391031Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7391366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7391658Z ) 2025-05-07T20:32:46.7392019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7392463Z def test_silu_mul_quant( 2025-05-07T20:32:46.7392714Z self, 2025-05-07T20:32:46.7392926Z T: int, 2025-05-07T20:32:46.7393130Z D: int, 2025-05-07T20:32:46.7393358Z scale_ub: Optional[float], 2025-05-07T20:32:46.7393644Z contiguous: bool, 2025-05-07T20:32:46.7393892Z compiled: bool, 2025-05-07T20:32:46.7394127Z ) -> None: 2025-05-07T20:32:46.7394350Z torch.manual_seed(2025) 2025-05-07T20:32:46.7394594Z 2025-05-07T20:32:46.7394864Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7395214Z 2025-05-07T20:32:46.7395412Z x_sign = torch.sign(x) 2025-05-07T20:32:46.7395766Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.7396183Z x = x_sign * x_clamp 2025-05-07T20:32:46.7396446Z x0 = x[:, :D] 2025-05-07T20:32:46.7396667Z x1 = x[:, D:] 2025-05-07T20:32:46.7396893Z 2025-05-07T20:32:46.7397086Z if contiguous: 2025-05-07T20:32:46.7397317Z x0 = x0.contiguous() 2025-05-07T20:32:46.7397580Z x1 = x1.contiguous() 2025-05-07T20:32:46.7397838Z 2025-05-07T20:32:46.7398031Z if scale_ub is not None: 2025-05-07T20:32:46.7398609Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.7399045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.7399371Z ) 2025-05-07T20:32:46.7399566Z else: 2025-05-07T20:32:46.7399789Z scale_ub_tensor = None 2025-05-07T20:32:46.7400053Z 2025-05-07T20:32:46.7400287Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7400611Z op = silu_mul_quant 2025-05-07T20:32:46.7400866Z if compiled: 2025-05-07T20:32:46.7401114Z op = torch.compile(op) 2025-05-07T20:32:46.7401510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.7401793Z 2025-05-07T20:32:46.7401983Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.7402159Z 2025-05-07T20:32:46.7402262Z moe/activation_test.py:117: 2025-05-07T20:32:46.7402565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7402904Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.7403182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.7403958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.7404650Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.7405181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.7405867Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.7406534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.7407066Z kernel = self.compile( 2025-05-07T20:32:46.7407600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.7408250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.7408646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7408872Z 2025-05-07T20:32:46.7409087Z self = 2025-05-07T20:32:46.7410163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.7411543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88bd28540>} 2025-05-07T20:32:46.7412879Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.7413984Z context = 2025-05-07T20:32:46.7414273Z 2025-05-07T20:32:46.7414441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.7414959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.7415427Z module_map=module_map) 2025-05-07T20:32:46.7415807Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.7416164Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.7416419Z E ^ 2025-05-07T20:32:46.7416882Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.7417333Z 2025-05-07T20:32:46.7417747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.7418253Z 2025-05-07T20:32:46.7418368Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7418829Z self=, 2025-05-07T20:32:46.7419231Z T=1, 2025-05-07T20:32:46.7419419Z D=7168, 2025-05-07T20:32:46.7419656Z scale_ub=None, 2025-05-07T20:32:46.7419881Z contiguous=True, 2025-05-07T20:32:46.7420143Z compiled=True, 2025-05-07T20:32:46.7420360Z ) 2025-05-07T20:32:46.7420679Z self = 2025-05-07T20:32:46.7421159Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:46.7421410Z 2025-05-07T20:32:46.7421491Z @given( 2025-05-07T20:32:46.7421761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7422073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7422379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7422697Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7423026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7423315Z ) 2025-05-07T20:32:46.7423658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7424099Z def test_silu_mul_quant( 2025-05-07T20:32:46.7424390Z self, 2025-05-07T20:32:46.7424583Z T: int, 2025-05-07T20:32:46.7424786Z D: int, 2025-05-07T20:32:46.7425009Z scale_ub: Optional[float], 2025-05-07T20:32:46.7425278Z contiguous: bool, 2025-05-07T20:32:46.7425518Z compiled: bool, 2025-05-07T20:32:46.7425744Z ) -> None: 2025-05-07T20:32:46.7425972Z torch.manual_seed(2025) 2025-05-07T20:32:46.7426221Z 2025-05-07T20:32:46.7426496Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7426831Z 2025-05-07T20:32:46.7427033Z x_sign = torch.sign(x) 2025-05-07T20:32:46.7427333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.7427650Z x = x_sign * x_clamp 2025-05-07T20:32:46.7427895Z x0 = x[:, :D] 2025-05-07T20:32:46.7428126Z x1 = x[:, D:] 2025-05-07T20:32:46.7428337Z 2025-05-07T20:32:46.7428525Z if contiguous: 2025-05-07T20:32:46.7428770Z x0 = x0.contiguous() 2025-05-07T20:32:46.7429042Z x1 = x1.contiguous() 2025-05-07T20:32:46.7429282Z 2025-05-07T20:32:46.7429488Z if scale_ub is not None: 2025-05-07T20:32:46.7429771Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.7430106Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.7430423Z ) 2025-05-07T20:32:46.7430622Z else: 2025-05-07T20:32:46.7430838Z scale_ub_tensor = None 2025-05-07T20:32:46.7431093Z 2025-05-07T20:32:46.7431330Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7431643Z op = silu_mul_quant 2025-05-07T20:32:46.7431900Z if compiled: 2025-05-07T20:32:46.7432153Z op = torch.compile(op) 2025-05-07T20:32:46.7432461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.7432740Z 2025-05-07T20:32:46.7432935Z y_fp8, y_scale = fn() 2025-05-07T20:32:46.7433229Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:46.7433519Z 2025-05-07T20:32:46.7433763Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7434103Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:46.7434395Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:46.7434713Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:46.7435080Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.7435391Z 2025-05-07T20:32:46.7435598Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:46.7435801Z 2025-05-07T20:32:46.7435901Z moe/activation_test.py:126: 2025-05-07T20:32:46.7436202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7436538Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:46.7436917Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.7437751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:46.7438498Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:46.7439049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.7439726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.7440451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:46.7441159Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:46.7441886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:46.7442523Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:46.7443159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:46.7443671Z fn() 2025-05-07T20:32:46.7444180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:46.7444759Z self.fn.run( 2025-05-07T20:32:46.7445225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.7445753Z kernel = self.compile( 2025-05-07T20:32:46.7446294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.7446941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.7447328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7447560Z 2025-05-07T20:32:46.7447767Z self = 2025-05-07T20:32:46.7448846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.7450265Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88bd28e00>} 2025-05-07T20:32:46.7451588Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.7452607Z context = 2025-05-07T20:32:46.7452900Z 2025-05-07T20:32:46.7453149Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.7453674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.7454145Z module_map=module_map) 2025-05-07T20:32:46.7454521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.7454881Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:46.7455148Z E ^ 2025-05-07T20:32:46.7455613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.7456064Z 2025-05-07T20:32:46.7456482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.7456987Z 2025-05-07T20:32:46.7457100Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7457515Z self=, 2025-05-07T20:32:46.7457921Z T=4096, 2025-05-07T20:32:46.7458119Z D=5120, 2025-05-07T20:32:46.7458398Z scale_ub=None, 2025-05-07T20:32:46.7458619Z contiguous=False, 2025-05-07T20:32:46.7458851Z compiled=False, 2025-05-07T20:32:46.7459108Z ) 2025-05-07T20:32:47.5320028Z self = 2025-05-07T20:32:47.5320992Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.5321347Z 2025-05-07T20:32:47.5321437Z @given( 2025-05-07T20:32:47.5321689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.5322059Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.5322701Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.5323036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.5323363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.5323655Z ) 2025-05-07T20:32:47.5324015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.5324466Z def test_silu_mul_quant( 2025-05-07T20:32:47.5324724Z self, 2025-05-07T20:32:47.5324925Z T: int, 2025-05-07T20:32:47.5325119Z D: int, 2025-05-07T20:32:47.5325438Z scale_ub: Optional[float], 2025-05-07T20:32:47.5325714Z contiguous: bool, 2025-05-07T20:32:47.5325963Z compiled: bool, 2025-05-07T20:32:47.5326191Z ) -> None: 2025-05-07T20:32:47.5326412Z torch.manual_seed(2025) 2025-05-07T20:32:47.5326665Z 2025-05-07T20:32:47.5326946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.5327308Z 2025-05-07T20:32:47.5327508Z x_sign = torch.sign(x) 2025-05-07T20:32:47.5327797Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.5328110Z x = x_sign * x_clamp 2025-05-07T20:32:47.5328357Z x0 = x[:, :D] 2025-05-07T20:32:47.5328571Z x1 = x[:, D:] 2025-05-07T20:32:47.5328787Z 2025-05-07T20:32:47.5328978Z if contiguous: 2025-05-07T20:32:47.5329209Z x0 = x0.contiguous() 2025-05-07T20:32:47.5329473Z x1 = x1.contiguous() 2025-05-07T20:32:47.5329725Z 2025-05-07T20:32:47.5329918Z if scale_ub is not None: 2025-05-07T20:32:47.5330191Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.5330533Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.5330842Z ) 2025-05-07T20:32:47.5331032Z else: 2025-05-07T20:32:47.5331251Z scale_ub_tensor = None 2025-05-07T20:32:47.5331511Z 2025-05-07T20:32:47.5331743Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5332066Z op = silu_mul_quant 2025-05-07T20:32:47.5332318Z if compiled: 2025-05-07T20:32:47.5332563Z op = torch.compile(op) 2025-05-07T20:32:47.5332863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5333260Z 2025-05-07T20:32:47.5333452Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.5333628Z 2025-05-07T20:32:47.5333729Z moe/activation_test.py:117: 2025-05-07T20:32:47.5334027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5334359Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.5334647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5335341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.5336030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.5336562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.5337248Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.5337909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.5338440Z kernel = self.compile( 2025-05-07T20:32:47.5338978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.5339823Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.5340230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5340459Z 2025-05-07T20:32:47.5340667Z self = 2025-05-07T20:32:47.5341743Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.5343159Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc890f7f240>} 2025-05-07T20:32:47.5344484Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.5345554Z context = 2025-05-07T20:32:47.5345841Z 2025-05-07T20:32:47.5346011Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.5346534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.5347004Z module_map=module_map) 2025-05-07T20:32:47.5347366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.5347722Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.5347986Z E ^ 2025-05-07T20:32:47.5348447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.5348890Z 2025-05-07T20:32:47.5349304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.5349813Z 2025-05-07T20:32:47.5349917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.5350381Z self=, 2025-05-07T20:32:47.5350789Z T=4096, 2025-05-07T20:32:47.5350976Z D=7168, 2025-05-07T20:32:47.5351171Z scale_ub=None, 2025-05-07T20:32:47.5351397Z contiguous=False, 2025-05-07T20:32:47.5351622Z compiled=False, 2025-05-07T20:32:47.5351830Z ) 2025-05-07T20:32:47.5352156Z self = 2025-05-07T20:32:47.5352647Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.5352929Z 2025-05-07T20:32:47.5353010Z @given( 2025-05-07T20:32:47.5353245Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.5353552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.5353863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.5354195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.5354526Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.5354813Z ) 2025-05-07T20:32:47.5355168Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.5355612Z def test_silu_mul_quant( 2025-05-07T20:32:47.5355850Z self, 2025-05-07T20:32:47.5356050Z T: int, 2025-05-07T20:32:47.5356252Z D: int, 2025-05-07T20:32:47.5356470Z scale_ub: Optional[float], 2025-05-07T20:32:47.5356751Z contiguous: bool, 2025-05-07T20:32:47.5356994Z compiled: bool, 2025-05-07T20:32:47.5357216Z ) -> None: 2025-05-07T20:32:47.5357436Z torch.manual_seed(2025) 2025-05-07T20:32:47.5357685Z 2025-05-07T20:32:47.5357957Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.5358303Z 2025-05-07T20:32:47.5358505Z x_sign = torch.sign(x) 2025-05-07T20:32:47.5358867Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.5359668Z x = x_sign * x_clamp 2025-05-07T20:32:47.5360001Z x0 = x[:, :D] 2025-05-07T20:32:47.5360234Z x1 = x[:, D:] 2025-05-07T20:32:47.5360454Z 2025-05-07T20:32:47.5367707Z if contiguous: 2025-05-07T20:32:47.5367955Z x0 = x0.contiguous() 2025-05-07T20:32:47.5368212Z x1 = x1.contiguous() 2025-05-07T20:32:47.5368462Z 2025-05-07T20:32:47.5368663Z if scale_ub is not None: 2025-05-07T20:32:47.5368943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.5369406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.5369722Z ) 2025-05-07T20:32:47.5369926Z else: 2025-05-07T20:32:47.5370139Z scale_ub_tensor = None 2025-05-07T20:32:47.5370402Z 2025-05-07T20:32:47.5370645Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5370962Z op = silu_mul_quant 2025-05-07T20:32:47.5371225Z if compiled: 2025-05-07T20:32:47.5371480Z op = torch.compile(op) 2025-05-07T20:32:47.5371843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5372128Z 2025-05-07T20:32:47.5372333Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.5372498Z 2025-05-07T20:32:47.5372602Z moe/activation_test.py:117: 2025-05-07T20:32:47.5372908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5373320Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.5373605Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5374292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.5374989Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.5375565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.5376540Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.5377484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.5378238Z kernel = self.compile( 2025-05-07T20:32:47.5379098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.5380098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.5380703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5381006Z 2025-05-07T20:32:47.5381221Z self = 2025-05-07T20:32:47.5382302Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.5383676Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88b181440>} 2025-05-07T20:32:47.5385015Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.5386050Z context = 2025-05-07T20:32:47.5386344Z 2025-05-07T20:32:47.5386527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.5387063Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.5387538Z module_map=module_map) 2025-05-07T20:32:47.5387924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.5388423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.5388697Z E ^ 2025-05-07T20:32:47.5389222Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.5389677Z 2025-05-07T20:32:47.5390102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.5390613Z 2025-05-07T20:32:47.5390734Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.5391153Z self=, 2025-05-07T20:32:47.5391612Z T=128, 2025-05-07T20:32:47.5391819Z D=7168, 2025-05-07T20:32:47.5392023Z scale_ub=None, 2025-05-07T20:32:47.5392258Z contiguous=False, 2025-05-07T20:32:47.5392509Z compiled=True, 2025-05-07T20:32:47.5392725Z ) 2025-05-07T20:32:47.5949362Z self = 2025-05-07T20:32:47.5950067Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.5950423Z 2025-05-07T20:32:47.5950515Z @given( 2025-05-07T20:32:47.5950965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.5951294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.5951621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.5951956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.5952302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.5952601Z ) 2025-05-07T20:32:47.5952977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.5953429Z def test_silu_mul_quant( 2025-05-07T20:32:47.5953688Z self, 2025-05-07T20:32:47.5953897Z T: int, 2025-05-07T20:32:47.5954102Z D: int, 2025-05-07T20:32:47.5954331Z scale_ub: Optional[float], 2025-05-07T20:32:47.5954612Z contiguous: bool, 2025-05-07T20:32:47.5954852Z compiled: bool, 2025-05-07T20:32:47.5955097Z ) -> None: 2025-05-07T20:32:47.5955327Z torch.manual_seed(2025) 2025-05-07T20:32:47.5955576Z 2025-05-07T20:32:47.5955864Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.5956219Z 2025-05-07T20:32:47.5956415Z x_sign = torch.sign(x) 2025-05-07T20:32:47.5956716Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.5957030Z x = x_sign * x_clamp 2025-05-07T20:32:47.5957273Z x0 = x[:, :D] 2025-05-07T20:32:47.5957498Z x1 = x[:, D:] 2025-05-07T20:32:47.5957732Z 2025-05-07T20:32:47.5957922Z if contiguous: 2025-05-07T20:32:47.5958172Z x0 = x0.contiguous() 2025-05-07T20:32:47.5958441Z x1 = x1.contiguous() 2025-05-07T20:32:47.5958688Z 2025-05-07T20:32:47.5958887Z if scale_ub is not None: 2025-05-07T20:32:47.5959165Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.5959825Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.5960137Z ) 2025-05-07T20:32:47.5960344Z else: 2025-05-07T20:32:47.5960585Z scale_ub_tensor = None 2025-05-07T20:32:47.5960848Z 2025-05-07T20:32:47.5961081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5961400Z op = silu_mul_quant 2025-05-07T20:32:47.5961656Z if compiled: 2025-05-07T20:32:47.5961909Z op = torch.compile(op) 2025-05-07T20:32:47.5962207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5962491Z 2025-05-07T20:32:47.5962688Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.5962977Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.5963277Z 2025-05-07T20:32:47.5963518Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5963859Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.5964159Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.5964563Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.5964924Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.5965331Z 2025-05-07T20:32:47.5965552Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.5965750Z 2025-05-07T20:32:47.5965853Z moe/activation_test.py:126: 2025-05-07T20:32:47.5966161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5966500Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.5966824Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.5967684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.5968439Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.5968988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.5969665Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.5970411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.5971134Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.5971863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.5972496Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.5973160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.5973680Z fn() 2025-05-07T20:32:47.5974184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.5974772Z self.fn.run( 2025-05-07T20:32:47.5975244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.5975779Z kernel = self.compile( 2025-05-07T20:32:47.5976319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.5976968Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.5977374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5977603Z 2025-05-07T20:32:47.5977822Z self = 2025-05-07T20:32:47.5978914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.5980334Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88af64540>} 2025-05-07T20:32:47.5981727Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.5982745Z context = 2025-05-07T20:32:47.5983029Z 2025-05-07T20:32:47.5983194Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.5983714Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.5984185Z module_map=module_map) 2025-05-07T20:32:47.5984554Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.5984907Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.5985181Z E ^ 2025-05-07T20:32:47.5985649Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.5986155Z 2025-05-07T20:32:47.5986617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.5987140Z 2025-05-07T20:32:47.5987248Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.5987672Z self=, 2025-05-07T20:32:47.5988078Z T=128, 2025-05-07T20:32:47.5988263Z D=7168, 2025-05-07T20:32:47.5988469Z scale_ub=None, 2025-05-07T20:32:47.5988738Z contiguous=False, 2025-05-07T20:32:47.5988968Z compiled=False, 2025-05-07T20:32:47.5989185Z ) 2025-05-07T20:32:47.7942635Z self = 2025-05-07T20:32:47.7943367Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:47.7943733Z 2025-05-07T20:32:47.7943844Z @given( 2025-05-07T20:32:47.7944134Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.7944458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.7945054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.7945390Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.7945727Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.7946016Z ) 2025-05-07T20:32:47.7946365Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.7946817Z def test_silu_mul_quant( 2025-05-07T20:32:47.7947071Z self, 2025-05-07T20:32:47.7947275Z T: int, 2025-05-07T20:32:47.7947484Z D: int, 2025-05-07T20:32:47.7947712Z scale_ub: Optional[float], 2025-05-07T20:32:47.7947984Z contiguous: bool, 2025-05-07T20:32:47.7948237Z compiled: bool, 2025-05-07T20:32:47.7948475Z ) -> None: 2025-05-07T20:32:47.7948691Z torch.manual_seed(2025) 2025-05-07T20:32:47.7948941Z 2025-05-07T20:32:47.7949219Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.7949564Z 2025-05-07T20:32:47.7949766Z x_sign = torch.sign(x) 2025-05-07T20:32:47.7950068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.7950385Z x = x_sign * x_clamp 2025-05-07T20:32:47.7950633Z x0 = x[:, :D] 2025-05-07T20:32:47.7950856Z x1 = x[:, D:] 2025-05-07T20:32:47.7951072Z 2025-05-07T20:32:47.7951260Z if contiguous: 2025-05-07T20:32:47.7951495Z x0 = x0.contiguous() 2025-05-07T20:32:47.7951762Z x1 = x1.contiguous() 2025-05-07T20:32:47.7952002Z 2025-05-07T20:32:47.7952219Z if scale_ub is not None: 2025-05-07T20:32:47.7952503Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.7952840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.7953160Z ) 2025-05-07T20:32:47.7953365Z else: 2025-05-07T20:32:47.7953586Z scale_ub_tensor = None 2025-05-07T20:32:47.7953857Z 2025-05-07T20:32:47.7954098Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.7954415Z op = silu_mul_quant 2025-05-07T20:32:47.7954676Z if compiled: 2025-05-07T20:32:47.7954933Z op = torch.compile(op) 2025-05-07T20:32:47.7955232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7955524Z 2025-05-07T20:32:47.7955731Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.7955900Z 2025-05-07T20:32:47.7956012Z moe/activation_test.py:117: 2025-05-07T20:32:47.7956325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7956664Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.7956953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7957642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.7958455Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.7959070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.7960085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.7960739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.7961272Z kernel = self.compile( 2025-05-07T20:32:47.7961813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.7962550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.7962950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7963181Z 2025-05-07T20:32:47.7963389Z self = 2025-05-07T20:32:47.7964520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.7965891Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88af66700>} 2025-05-07T20:32:47.7967221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.7968241Z context = 2025-05-07T20:32:47.7968557Z 2025-05-07T20:32:47.7968725Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.7969245Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.7969716Z module_map=module_map) 2025-05-07T20:32:47.7970081Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.7970439Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.7970704Z E ^ 2025-05-07T20:32:47.7971167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.7971615Z 2025-05-07T20:32:47.7972027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.7972548Z 2025-05-07T20:32:47.7972651Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.7973133Z self=, 2025-05-07T20:32:47.7973529Z T=4096, 2025-05-07T20:32:47.7973724Z D=5120, 2025-05-07T20:32:47.7973925Z scale_ub=1200.0, 2025-05-07T20:32:47.7974147Z contiguous=True, 2025-05-07T20:32:47.7974373Z compiled=False, 2025-05-07T20:32:47.7974591Z ) 2025-05-07T20:32:47.7974909Z self = 2025-05-07T20:32:47.7975414Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:47.7975688Z 2025-05-07T20:32:47.7975778Z @given( 2025-05-07T20:32:47.7976015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.7976325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.7976641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.7976974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.7977313Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.7977608Z ) 2025-05-07T20:32:47.7977961Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.7978399Z def test_silu_mul_quant( 2025-05-07T20:32:47.7978654Z self, 2025-05-07T20:32:47.7978863Z T: int, 2025-05-07T20:32:47.7979132Z D: int, 2025-05-07T20:32:47.7979359Z scale_ub: Optional[float], 2025-05-07T20:32:47.7979640Z contiguous: bool, 2025-05-07T20:32:47.7979950Z compiled: bool, 2025-05-07T20:32:47.7980180Z ) -> None: 2025-05-07T20:32:47.7980405Z torch.manual_seed(2025) 2025-05-07T20:32:47.7980646Z 2025-05-07T20:32:47.7980925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.7981274Z 2025-05-07T20:32:47.7981472Z x_sign = torch.sign(x) 2025-05-07T20:32:47.7981759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.7982118Z x = x_sign * x_clamp 2025-05-07T20:32:47.7982365Z x0 = x[:, :D] 2025-05-07T20:32:47.7982578Z x1 = x[:, D:] 2025-05-07T20:32:47.7982794Z 2025-05-07T20:32:47.7982981Z if contiguous: 2025-05-07T20:32:47.7983211Z x0 = x0.contiguous() 2025-05-07T20:32:47.7983476Z x1 = x1.contiguous() 2025-05-07T20:32:47.7983723Z 2025-05-07T20:32:47.7983916Z if scale_ub is not None: 2025-05-07T20:32:47.7984193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.7984580Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.7984888Z ) 2025-05-07T20:32:47.7985089Z else: 2025-05-07T20:32:47.7985304Z scale_ub_tensor = None 2025-05-07T20:32:47.7985556Z 2025-05-07T20:32:47.7985795Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.7986114Z op = silu_mul_quant 2025-05-07T20:32:47.7986372Z if compiled: 2025-05-07T20:32:47.7986628Z op = torch.compile(op) 2025-05-07T20:32:47.7986929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7987212Z 2025-05-07T20:32:47.7987406Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.7987578Z 2025-05-07T20:32:47.7987678Z moe/activation_test.py:117: 2025-05-07T20:32:47.7987982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7988312Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.7988603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7989294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.7989978Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.7990510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.7991193Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.7991857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.7992379Z kernel = self.compile( 2025-05-07T20:32:47.7992924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.7993579Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.7993981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7994213Z 2025-05-07T20:32:47.7994423Z self = 2025-05-07T20:32:47.7995491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.7996848Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88af676a0>} 2025-05-07T20:32:47.7998178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.7999245Z context = 2025-05-07T20:32:47.7999533Z 2025-05-07T20:32:47.7999738Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.8000264Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.8000786Z module_map=module_map) 2025-05-07T20:32:47.8001148Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.8001512Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.8001772Z E ^ 2025-05-07T20:32:47.8002278Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.8002723Z 2025-05-07T20:32:47.8003135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.8003645Z 2025-05-07T20:32:47.8003751Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.8004175Z self=, 2025-05-07T20:32:47.8004567Z T=1, 2025-05-07T20:32:47.8004797Z D=5120, 2025-05-07T20:32:47.8004998Z scale_ub=None, 2025-05-07T20:32:47.8005210Z contiguous=True, 2025-05-07T20:32:47.8005438Z compiled=True, 2025-05-07T20:32:47.8005648Z ) 2025-05-07T20:32:48.1738602Z self = 2025-05-07T20:32:48.1739318Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:48.1739701Z 2025-05-07T20:32:48.1739786Z @given( 2025-05-07T20:32:48.1740038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.1740360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.1740677Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.1741017Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.1741359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.1741662Z ) 2025-05-07T20:32:48.1742017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.1742480Z def test_silu_mul_quant( 2025-05-07T20:32:48.1742727Z self, 2025-05-07T20:32:48.1742923Z T: int, 2025-05-07T20:32:48.1743128Z D: int, 2025-05-07T20:32:48.1743353Z scale_ub: Optional[float], 2025-05-07T20:32:48.1743629Z contiguous: bool, 2025-05-07T20:32:48.1743888Z compiled: bool, 2025-05-07T20:32:48.1744125Z ) -> None: 2025-05-07T20:32:48.1744350Z torch.manual_seed(2025) 2025-05-07T20:32:48.1744607Z 2025-05-07T20:32:48.1744882Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.1745225Z 2025-05-07T20:32:48.1745435Z x_sign = torch.sign(x) 2025-05-07T20:32:48.1745733Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.1746050Z x = x_sign * x_clamp 2025-05-07T20:32:48.1746321Z x0 = x[:, :D] 2025-05-07T20:32:48.1746540Z x1 = x[:, D:] 2025-05-07T20:32:48.1746759Z 2025-05-07T20:32:48.1746957Z if contiguous: 2025-05-07T20:32:48.1747190Z x0 = x0.contiguous() 2025-05-07T20:32:48.1747452Z x1 = x1.contiguous() 2025-05-07T20:32:48.1747700Z 2025-05-07T20:32:48.1747894Z if scale_ub is not None: 2025-05-07T20:32:48.1748177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.1748524Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.1748831Z ) 2025-05-07T20:32:48.1749034Z else: 2025-05-07T20:32:48.1749255Z scale_ub_tensor = None 2025-05-07T20:32:48.1749508Z 2025-05-07T20:32:48.1749752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.1750077Z op = silu_mul_quant 2025-05-07T20:32:48.1750333Z if compiled: 2025-05-07T20:32:48.1750594Z op = torch.compile(op) 2025-05-07T20:32:48.1751184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.1751469Z 2025-05-07T20:32:48.1751665Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.1752057Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.1752361Z 2025-05-07T20:32:48.1752596Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.1752942Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.1753242Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.1753554Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.1754013Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.1754349Z 2025-05-07T20:32:48.1761773Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.1761999Z 2025-05-07T20:32:48.1762109Z moe/activation_test.py:126: 2025-05-07T20:32:48.1762423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.1762771Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.1763106Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.1764056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.1764814Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.1765350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.1766029Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.1766720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.1767437Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.1768152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.1768788Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.1769394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.1769903Z fn() 2025-05-07T20:32:48.1770410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.1770992Z self.fn.run( 2025-05-07T20:32:48.1771458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.1771979Z kernel = self.compile( 2025-05-07T20:32:48.1772517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.1773231Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.1773619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.1773854Z 2025-05-07T20:32:48.1774060Z self = 2025-05-07T20:32:48.1775134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.1776498Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88bd43880>} 2025-05-07T20:32:48.1777823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.1778825Z context = 2025-05-07T20:32:48.1779117Z 2025-05-07T20:32:48.1779282Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.1779946Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.1780413Z module_map=module_map) 2025-05-07T20:32:48.1780772Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.1781128Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.1781400Z E ^ 2025-05-07T20:32:48.1781859Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.1782382Z 2025-05-07T20:32:48.1782792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.1783303Z 2025-05-07T20:32:48.1783406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.1783821Z self=, 2025-05-07T20:32:48.1784209Z T=2048, 2025-05-07T20:32:48.1784403Z D=5120, 2025-05-07T20:32:48.1784603Z scale_ub=None, 2025-05-07T20:32:48.1784816Z contiguous=True, 2025-05-07T20:32:48.1785044Z compiled=True, 2025-05-07T20:32:48.1785301Z ) 2025-05-07T20:32:48.5388958Z self = 2025-05-07T20:32:48.5389701Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:48.5390037Z 2025-05-07T20:32:48.5390121Z @given( 2025-05-07T20:32:48.5390380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.5390728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.5391044Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.5391397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.5391744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.5392036Z ) 2025-05-07T20:32:48.5392395Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.5392865Z def test_silu_mul_quant( 2025-05-07T20:32:48.5393120Z self, 2025-05-07T20:32:48.5393325Z T: int, 2025-05-07T20:32:48.5393550Z D: int, 2025-05-07T20:32:48.5393792Z scale_ub: Optional[float], 2025-05-07T20:32:48.5394075Z contiguous: bool, 2025-05-07T20:32:48.5394337Z compiled: bool, 2025-05-07T20:32:48.5394593Z ) -> None: 2025-05-07T20:32:48.5394823Z torch.manual_seed(2025) 2025-05-07T20:32:48.5395087Z 2025-05-07T20:32:48.5395376Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.5395731Z 2025-05-07T20:32:48.5395940Z x_sign = torch.sign(x) 2025-05-07T20:32:48.5396254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.5396569Z x = x_sign * x_clamp 2025-05-07T20:32:48.5396827Z x0 = x[:, :D] 2025-05-07T20:32:48.5397057Z x1 = x[:, D:] 2025-05-07T20:32:48.5397269Z 2025-05-07T20:32:48.5397479Z if contiguous: 2025-05-07T20:32:48.5397732Z x0 = x0.contiguous() 2025-05-07T20:32:48.5398014Z x1 = x1.contiguous() 2025-05-07T20:32:48.5398362Z 2025-05-07T20:32:48.5398644Z if scale_ub is not None: 2025-05-07T20:32:48.5399035Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.5399502Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.5399919Z ) 2025-05-07T20:32:48.5400182Z else: 2025-05-07T20:32:48.5400469Z scale_ub_tensor = None 2025-05-07T20:32:48.5400857Z 2025-05-07T20:32:48.5401221Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.5401746Z op = silu_mul_quant 2025-05-07T20:32:48.5402124Z if compiled: 2025-05-07T20:32:48.5402474Z op = torch.compile(op) 2025-05-07T20:32:48.5402889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.5403279Z 2025-05-07T20:32:48.5403494Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.5403973Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.5404269Z 2025-05-07T20:32:48.5404607Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.5404957Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.5405257Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.5405574Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.5405941Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.5406265Z 2025-05-07T20:32:48.5406469Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.5406754Z 2025-05-07T20:32:48.5406861Z moe/activation_test.py:126: 2025-05-07T20:32:48.5407169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.5407509Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.5407842Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.5408639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.5409483Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.5410027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.5410740Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.5411452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.5412187Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.5412917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.5413701Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.5414306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.5414831Z fn() 2025-05-07T20:32:48.5415348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.5415934Z self.fn.run( 2025-05-07T20:32:48.5416405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.5416929Z kernel = self.compile( 2025-05-07T20:32:48.5417473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.5418137Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.5418534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.5418769Z 2025-05-07T20:32:48.5418981Z self = 2025-05-07T20:32:48.5420068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.5421441Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88b6f3ba0>} 2025-05-07T20:32:48.5422770Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.5423784Z context = 2025-05-07T20:32:48.5424082Z 2025-05-07T20:32:48.5424250Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.5424771Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.5425300Z module_map=module_map) 2025-05-07T20:32:48.5425705Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.5426073Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.5426353Z E ^ 2025-05-07T20:32:48.5426812Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.5427263Z 2025-05-07T20:32:48.5427675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.5428231Z 2025-05-07T20:32:48.5428339Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.5428761Z self=, 2025-05-07T20:32:48.5429158Z T=128, 2025-05-07T20:32:48.5429364Z D=5120, 2025-05-07T20:32:48.5429574Z scale_ub=None, 2025-05-07T20:32:48.5429793Z contiguous=True, 2025-05-07T20:32:48.5430031Z compiled=True, 2025-05-07T20:32:48.5430246Z ) 2025-05-07T20:32:48.9620790Z self = 2025-05-07T20:32:48.9621835Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:48.9622122Z 2025-05-07T20:32:48.9622208Z @given( 2025-05-07T20:32:48.9622451Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.9622779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.9623086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.9623434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.9623777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.9624070Z ) 2025-05-07T20:32:48.9624428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.9624876Z def test_silu_mul_quant( 2025-05-07T20:32:48.9625123Z self, 2025-05-07T20:32:48.9625316Z T: int, 2025-05-07T20:32:48.9625536Z D: int, 2025-05-07T20:32:48.9625765Z scale_ub: Optional[float], 2025-05-07T20:32:48.9626043Z contiguous: bool, 2025-05-07T20:32:48.9626303Z compiled: bool, 2025-05-07T20:32:48.9626538Z ) -> None: 2025-05-07T20:32:48.9626753Z torch.manual_seed(2025) 2025-05-07T20:32:48.9626999Z 2025-05-07T20:32:48.9627281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.9627627Z 2025-05-07T20:32:48.9627828Z x_sign = torch.sign(x) 2025-05-07T20:32:48.9628136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.9628445Z x = x_sign * x_clamp 2025-05-07T20:32:48.9628692Z x0 = x[:, :D] 2025-05-07T20:32:48.9628913Z x1 = x[:, D:] 2025-05-07T20:32:48.9629126Z 2025-05-07T20:32:48.9629311Z if contiguous: 2025-05-07T20:32:48.9629552Z x0 = x0.contiguous() 2025-05-07T20:32:48.9629814Z x1 = x1.contiguous() 2025-05-07T20:32:48.9630052Z 2025-05-07T20:32:48.9630244Z if scale_ub is not None: 2025-05-07T20:32:48.9630525Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.9630890Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.9631231Z ) 2025-05-07T20:32:48.9631431Z else: 2025-05-07T20:32:48.9631641Z scale_ub_tensor = None 2025-05-07T20:32:48.9631894Z 2025-05-07T20:32:48.9632131Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.9632446Z op = silu_mul_quant 2025-05-07T20:32:48.9632710Z if compiled: 2025-05-07T20:32:48.9632985Z op = torch.compile(op) 2025-05-07T20:32:48.9633287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.9633572Z 2025-05-07T20:32:48.9633774Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.9634059Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.9634361Z 2025-05-07T20:32:48.9634607Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.9635035Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.9635458Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.9635781Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.9636140Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.9636445Z 2025-05-07T20:32:48.9636652Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.9636845Z 2025-05-07T20:32:48.9636959Z moe/activation_test.py:126: 2025-05-07T20:32:48.9637256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9637674Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.9638005Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.9638782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.9639531Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.9640076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.9640796Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.9641524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.9642241Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.9642968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.9643604Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.9644195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.9644714Z fn() 2025-05-07T20:32:48.9645218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.9645788Z self.fn.run( 2025-05-07T20:32:48.9646265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.9646797Z kernel = self.compile( 2025-05-07T20:32:48.9647339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.9647979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.9648378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.9648603Z 2025-05-07T20:32:48.9648818Z self = 2025-05-07T20:32:48.9649893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.9651318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a4914e0>} 2025-05-07T20:32:48.9652644Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.9653798Z context = 2025-05-07T20:32:48.9654087Z 2025-05-07T20:32:48.9654260Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.9654774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.9655247Z module_map=module_map) 2025-05-07T20:32:48.9655618Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.9656035Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.9656302Z E ^ 2025-05-07T20:32:48.9656814Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.9657263Z 2025-05-07T20:32:48.9657682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.9658185Z 2025-05-07T20:32:48.9658297Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.9658705Z self=, 2025-05-07T20:32:48.9659147Z T=4096, 2025-05-07T20:32:48.9659633Z D=5120, 2025-05-07T20:32:48.9659828Z scale_ub=None, 2025-05-07T20:32:48.9660047Z contiguous=True, 2025-05-07T20:32:48.9660279Z compiled=True, 2025-05-07T20:32:48.9660492Z ) 2025-05-07T20:32:49.3884811Z self = 2025-05-07T20:32:49.3885569Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.3885940Z 2025-05-07T20:32:49.3886029Z @given( 2025-05-07T20:32:49.3886545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.3886862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.3887160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.3887496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.3887823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.3888102Z ) 2025-05-07T20:32:49.3888456Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.3888896Z def test_silu_mul_quant( 2025-05-07T20:32:49.3889130Z self, 2025-05-07T20:32:49.3889324Z T: int, 2025-05-07T20:32:49.3889527Z D: int, 2025-05-07T20:32:49.3889741Z scale_ub: Optional[float], 2025-05-07T20:32:49.3890014Z contiguous: bool, 2025-05-07T20:32:49.3890262Z compiled: bool, 2025-05-07T20:32:49.3890491Z ) -> None: 2025-05-07T20:32:49.3890707Z torch.manual_seed(2025) 2025-05-07T20:32:49.3890958Z 2025-05-07T20:32:49.3891274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.3891616Z 2025-05-07T20:32:49.3891813Z x_sign = torch.sign(x) 2025-05-07T20:32:49.3892108Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.3892414Z x = x_sign * x_clamp 2025-05-07T20:32:49.3892660Z x0 = x[:, :D] 2025-05-07T20:32:49.3892888Z x1 = x[:, D:] 2025-05-07T20:32:49.3893182Z 2025-05-07T20:32:49.3893375Z if contiguous: 2025-05-07T20:32:49.3893609Z x0 = x0.contiguous() 2025-05-07T20:32:49.3893866Z x1 = x1.contiguous() 2025-05-07T20:32:49.3894112Z 2025-05-07T20:32:49.3894310Z if scale_ub is not None: 2025-05-07T20:32:49.3894584Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.3894924Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.3895236Z ) 2025-05-07T20:32:49.3895431Z else: 2025-05-07T20:32:49.3895657Z scale_ub_tensor = None 2025-05-07T20:32:49.3895920Z 2025-05-07T20:32:49.3896154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.3896459Z op = silu_mul_quant 2025-05-07T20:32:49.3896710Z if compiled: 2025-05-07T20:32:49.3896958Z op = torch.compile(op) 2025-05-07T20:32:49.3897251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3897530Z 2025-05-07T20:32:49.3897728Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.3898011Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.3898304Z 2025-05-07T20:32:49.3898542Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.3898867Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.3899155Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.3899555Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.3899987Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.3900303Z 2025-05-07T20:32:49.3900509Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.3900702Z 2025-05-07T20:32:49.3900818Z moe/activation_test.py:126: 2025-05-07T20:32:49.3901145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3901480Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.3901961Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.3902735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.3903479Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.3904018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.3904695Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.3905415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.3906131Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.3906854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.3907486Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.3908080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.3908597Z fn() 2025-05-07T20:32:49.3909103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.3909674Z self.fn.run( 2025-05-07T20:32:49.3910145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.3910672Z kernel = self.compile( 2025-05-07T20:32:49.3911211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.3911850Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.3912248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3912473Z 2025-05-07T20:32:49.3912685Z self = 2025-05-07T20:32:49.3913761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.3915125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a2b54e0>} 2025-05-07T20:32:49.3916449Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.3917459Z context = 2025-05-07T20:32:49.3917745Z 2025-05-07T20:32:49.3917917Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.3918425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.3918889Z module_map=module_map) 2025-05-07T20:32:49.3919252Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.3919610Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.3919873Z E ^ 2025-05-07T20:32:49.3920394Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.3920839Z 2025-05-07T20:32:49.3921300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.3921802Z 2025-05-07T20:32:49.3921905Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.3922316Z self=, 2025-05-07T20:32:49.3922711Z T=16384, 2025-05-07T20:32:49.3922903Z D=5120, 2025-05-07T20:32:49.3923141Z scale_ub=None, 2025-05-07T20:32:49.3923353Z contiguous=True, 2025-05-07T20:32:49.3923579Z compiled=True, 2025-05-07T20:32:49.3923783Z ) 2025-05-07T20:32:49.4183343Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:49.4184580Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:49.4186151Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:49.4187121Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:49.4188205Z W0507 20:32:49.416000 99018 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:49.5071203Z self = 2025-05-07T20:32:49.5071748Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.5072018Z 2025-05-07T20:32:49.5072122Z @given( 2025-05-07T20:32:49.5072353Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.5072679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.5073004Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.5073340Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.5073660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.5073946Z ) 2025-05-07T20:32:49.5074297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.5074739Z def test_silu_mul_quant( 2025-05-07T20:32:49.5074989Z self, 2025-05-07T20:32:49.5075190Z T: int, 2025-05-07T20:32:49.5075390Z D: int, 2025-05-07T20:32:49.5075619Z scale_ub: Optional[float], 2025-05-07T20:32:49.5075895Z contiguous: bool, 2025-05-07T20:32:49.5076139Z compiled: bool, 2025-05-07T20:32:49.5076370Z ) -> None: 2025-05-07T20:32:49.5076587Z torch.manual_seed(2025) 2025-05-07T20:32:49.5076830Z 2025-05-07T20:32:49.5077106Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.5077457Z 2025-05-07T20:32:49.5084611Z x_sign = torch.sign(x) 2025-05-07T20:32:49.5084955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.5085288Z x = x_sign * x_clamp 2025-05-07T20:32:49.5085546Z x0 = x[:, :D] 2025-05-07T20:32:49.5085756Z x1 = x[:, D:] 2025-05-07T20:32:49.5085969Z 2025-05-07T20:32:49.5086161Z if contiguous: 2025-05-07T20:32:49.5086394Z x0 = x0.contiguous() 2025-05-07T20:32:49.5086657Z x1 = x1.contiguous() 2025-05-07T20:32:49.5086900Z 2025-05-07T20:32:49.5087092Z if scale_ub is not None: 2025-05-07T20:32:49.5087372Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.5087716Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.5088020Z ) 2025-05-07T20:32:49.5088488Z else: 2025-05-07T20:32:49.5088711Z scale_ub_tensor = None 2025-05-07T20:32:49.5088957Z 2025-05-07T20:32:49.5089292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.5089615Z op = silu_mul_quant 2025-05-07T20:32:49.5089874Z if compiled: 2025-05-07T20:32:49.5090117Z op = torch.compile(op) 2025-05-07T20:32:49.5090415Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.5090690Z 2025-05-07T20:32:49.5090885Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.5091172Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.5091544Z 2025-05-07T20:32:49.5091779Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.5092117Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.5092418Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.5092727Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.5093174Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.5093491Z 2025-05-07T20:32:49.5093686Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.5093887Z 2025-05-07T20:32:49.5094076Z moe/activation_test.py:126: 2025-05-07T20:32:49.5094378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.5094709Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.5095026Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.5095814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.5096561Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.5097102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.5097775Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.5098467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.5099186Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.5099898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.5100529Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.5101117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.5101631Z fn() 2025-05-07T20:32:49.5102132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.5102730Z self.fn.run( 2025-05-07T20:32:49.5103200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.5103728Z kernel = self.compile( 2025-05-07T20:32:49.5104270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.5104923Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.5105326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.5105554Z 2025-05-07T20:32:49.5105765Z self = 2025-05-07T20:32:49.5106845Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.5108224Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fd1afc0>} 2025-05-07T20:32:49.5109641Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.5110661Z context = 2025-05-07T20:32:49.5110950Z 2025-05-07T20:32:49.5111116Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.5111638Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.5112109Z module_map=module_map) 2025-05-07T20:32:49.5112513Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.5112877Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.5113154Z E ^ 2025-05-07T20:32:49.5113622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.5114074Z 2025-05-07T20:32:49.5114488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.5115003Z 2025-05-07T20:32:49.5115153Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.5115573Z self=, 2025-05-07T20:32:49.5115973Z T=1, 2025-05-07T20:32:49.5116167Z D=5120, 2025-05-07T20:32:49.5116370Z scale_ub=1200.0, 2025-05-07T20:32:49.5116601Z contiguous=True, 2025-05-07T20:32:49.5116820Z compiled=True, 2025-05-07T20:32:49.5117039Z ) 2025-05-07T20:32:49.6532855Z self = 2025-05-07T20:32:49.6533452Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:49.6533715Z 2025-05-07T20:32:49.6533797Z @given( 2025-05-07T20:32:49.6534046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.6534366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.6534689Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.6535036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.6535383Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.6535687Z ) 2025-05-07T20:32:49.6536041Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.6536502Z def test_silu_mul_quant( 2025-05-07T20:32:49.6536763Z self, 2025-05-07T20:32:49.6536967Z T: int, 2025-05-07T20:32:49.6537179Z D: int, 2025-05-07T20:32:49.6537426Z scale_ub: Optional[float], 2025-05-07T20:32:49.6537702Z contiguous: bool, 2025-05-07T20:32:49.6537953Z compiled: bool, 2025-05-07T20:32:49.6538204Z ) -> None: 2025-05-07T20:32:49.6538432Z torch.manual_seed(2025) 2025-05-07T20:32:49.6538698Z 2025-05-07T20:32:49.6538988Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.6539337Z 2025-05-07T20:32:49.6539541Z x_sign = torch.sign(x) 2025-05-07T20:32:49.6539846Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.6540176Z x = x_sign * x_clamp 2025-05-07T20:32:49.6540426Z x0 = x[:, :D] 2025-05-07T20:32:49.6540657Z x1 = x[:, D:] 2025-05-07T20:32:49.6540885Z 2025-05-07T20:32:49.6541083Z if contiguous: 2025-05-07T20:32:49.6541331Z x0 = x0.contiguous() 2025-05-07T20:32:49.6541602Z x1 = x1.contiguous() 2025-05-07T20:32:49.6541847Z 2025-05-07T20:32:49.6542053Z if scale_ub is not None: 2025-05-07T20:32:49.6542344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.6542688Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.6543004Z ) 2025-05-07T20:32:49.6543205Z else: 2025-05-07T20:32:49.6543420Z scale_ub_tensor = None 2025-05-07T20:32:49.6543684Z 2025-05-07T20:32:49.6543927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.6544500Z op = silu_mul_quant 2025-05-07T20:32:49.6544759Z if compiled: 2025-05-07T20:32:49.6545131Z op = torch.compile(op) 2025-05-07T20:32:49.6545434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.6545718Z 2025-05-07T20:32:49.6545918Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.6546085Z 2025-05-07T20:32:49.6546197Z moe/activation_test.py:117: 2025-05-07T20:32:49.6546492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.6546913Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.6547202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.6547763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.6548322Z return fn(*args, **kwargs) 2025-05-07T20:32:49.6548979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.6549672Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.6550277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.6550954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.6551622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.6552148Z kernel = self.compile( 2025-05-07T20:32:49.6552692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.6553351Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.6553761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.6553990Z 2025-05-07T20:32:49.6554205Z self = 2025-05-07T20:32:49.6555294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.6556668Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a9d9260>} 2025-05-07T20:32:49.6557997Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.6559018Z context = 2025-05-07T20:32:49.6559550Z 2025-05-07T20:32:49.6559724Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.6560258Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.6560745Z module_map=module_map) 2025-05-07T20:32:49.6561125Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.6561501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.6561776Z E ^ 2025-05-07T20:32:49.6562249Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.6562695Z 2025-05-07T20:32:49.6563111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.6563628Z 2025-05-07T20:32:49.6563737Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.6564161Z self=, 2025-05-07T20:32:49.6564577Z T=1, 2025-05-07T20:32:49.6564777Z D=5120, 2025-05-07T20:32:49.6564988Z scale_ub=None, 2025-05-07T20:32:49.6565293Z contiguous=False, 2025-05-07T20:32:49.6565521Z compiled=True, 2025-05-07T20:32:49.6565735Z ) 2025-05-07T20:32:49.8720448Z self = 2025-05-07T20:32:49.8721355Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:49.8721627Z 2025-05-07T20:32:49.8721714Z @given( 2025-05-07T20:32:49.8721947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.8722271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.8722581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.8723057Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.8723384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.8723677Z ) 2025-05-07T20:32:49.8724018Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.8724465Z def test_silu_mul_quant( 2025-05-07T20:32:49.8724717Z self, 2025-05-07T20:32:49.8724915Z T: int, 2025-05-07T20:32:49.8725113Z D: int, 2025-05-07T20:32:49.8725337Z scale_ub: Optional[float], 2025-05-07T20:32:49.8725688Z contiguous: bool, 2025-05-07T20:32:49.8725934Z compiled: bool, 2025-05-07T20:32:49.8726161Z ) -> None: 2025-05-07T20:32:49.8726385Z torch.manual_seed(2025) 2025-05-07T20:32:49.8726624Z 2025-05-07T20:32:49.8726898Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.8727236Z 2025-05-07T20:32:49.8727428Z x_sign = torch.sign(x) 2025-05-07T20:32:49.8727722Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.8728027Z x = x_sign * x_clamp 2025-05-07T20:32:49.8728282Z x0 = x[:, :D] 2025-05-07T20:32:49.8728508Z x1 = x[:, D:] 2025-05-07T20:32:49.8728721Z 2025-05-07T20:32:49.8728907Z if contiguous: 2025-05-07T20:32:49.8729152Z x0 = x0.contiguous() 2025-05-07T20:32:49.8729414Z x1 = x1.contiguous() 2025-05-07T20:32:49.8729651Z 2025-05-07T20:32:49.8729848Z if scale_ub is not None: 2025-05-07T20:32:49.8730133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.8730464Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.8730773Z ) 2025-05-07T20:32:49.8730978Z else: 2025-05-07T20:32:49.8731210Z scale_ub_tensor = None 2025-05-07T20:32:49.8731499Z 2025-05-07T20:32:49.8731733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.8732048Z op = silu_mul_quant 2025-05-07T20:32:49.8732296Z if compiled: 2025-05-07T20:32:49.8732547Z op = torch.compile(op) 2025-05-07T20:32:49.8732842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.8733223Z 2025-05-07T20:32:49.8733420Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.8733705Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.8733996Z 2025-05-07T20:32:49.8734239Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.8734576Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.8734866Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.8735175Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.8735534Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.8735843Z 2025-05-07T20:32:49.8736045Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.8736242Z 2025-05-07T20:32:49.8736342Z moe/activation_test.py:126: 2025-05-07T20:32:49.8736650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.8736980Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.8737310Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.8738090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.8738913Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.8739496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.8740172Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.8740852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.8741562Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.8742324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.8742960Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.8743556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.8744065Z fn() 2025-05-07T20:32:49.8744572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.8745189Z self.fn.run( 2025-05-07T20:32:49.8745652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.8746189Z kernel = self.compile( 2025-05-07T20:32:49.8746733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.8747380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.8747784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.8748020Z 2025-05-07T20:32:49.8748229Z self = 2025-05-07T20:32:49.8749301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.8750673Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a9dbd80>} 2025-05-07T20:32:49.8752054Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.8753077Z context = 2025-05-07T20:32:49.8753372Z 2025-05-07T20:32:49.8753541Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.8754074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.8754536Z module_map=module_map) 2025-05-07T20:32:49.8754908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.8755273Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.8755551Z E ^ 2025-05-07T20:32:49.8756013Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.8756464Z 2025-05-07T20:32:49.8756873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.8757372Z 2025-05-07T20:32:49.8757487Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.8757898Z self=, 2025-05-07T20:32:49.8758304Z T=1, 2025-05-07T20:32:49.8758503Z D=5120, 2025-05-07T20:32:49.8758702Z scale_ub=None, 2025-05-07T20:32:49.8758920Z contiguous=True, 2025-05-07T20:32:49.8759150Z compiled=False, 2025-05-07T20:32:49.8759660Z ) 2025-05-07T20:32:50.0275177Z self = 2025-05-07T20:32:50.0276146Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:50.0276512Z 2025-05-07T20:32:50.0276592Z @given( 2025-05-07T20:32:50.0276825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.0277145Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.0277450Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.0277789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.0278123Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.0278488Z ) 2025-05-07T20:32:50.0278844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.0279287Z def test_silu_mul_quant( 2025-05-07T20:32:50.0279532Z self, 2025-05-07T20:32:50.0279727Z T: int, 2025-05-07T20:32:50.0279936Z D: int, 2025-05-07T20:32:50.0280161Z scale_ub: Optional[float], 2025-05-07T20:32:50.0280435Z contiguous: bool, 2025-05-07T20:32:50.0280679Z compiled: bool, 2025-05-07T20:32:50.0280919Z ) -> None: 2025-05-07T20:32:50.0281216Z torch.manual_seed(2025) 2025-05-07T20:32:50.0281521Z 2025-05-07T20:32:50.0281804Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.0282144Z 2025-05-07T20:32:50.0282345Z x_sign = torch.sign(x) 2025-05-07T20:32:50.0282642Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.0282954Z x = x_sign * x_clamp 2025-05-07T20:32:50.0283211Z x0 = x[:, :D] 2025-05-07T20:32:50.0283435Z x1 = x[:, D:] 2025-05-07T20:32:50.0283645Z 2025-05-07T20:32:50.0283838Z if contiguous: 2025-05-07T20:32:50.0284079Z x0 = x0.contiguous() 2025-05-07T20:32:50.0284342Z x1 = x1.contiguous() 2025-05-07T20:32:50.0284598Z 2025-05-07T20:32:50.0284796Z if scale_ub is not None: 2025-05-07T20:32:50.0285078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.0285412Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.0285728Z ) 2025-05-07T20:32:50.0285936Z else: 2025-05-07T20:32:50.0286149Z scale_ub_tensor = None 2025-05-07T20:32:50.0286404Z 2025-05-07T20:32:50.0286638Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.0286942Z op = silu_mul_quant 2025-05-07T20:32:50.0287192Z if compiled: 2025-05-07T20:32:50.0287442Z op = torch.compile(op) 2025-05-07T20:32:50.0287734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.0288014Z 2025-05-07T20:32:50.0288212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.0288374Z 2025-05-07T20:32:50.0288474Z moe/activation_test.py:117: 2025-05-07T20:32:50.0288772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0289101Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.0289391Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.0290073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.0290756Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.0291299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.0292013Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.0292668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.0293274Z kernel = self.compile( 2025-05-07T20:32:50.0293812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.0294453Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.0294901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0295125Z 2025-05-07T20:32:50.0295386Z self = 2025-05-07T20:32:50.0296452Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.0297820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fb00fe0>} 2025-05-07T20:32:50.0299191Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.0300205Z context = 2025-05-07T20:32:50.0300492Z 2025-05-07T20:32:50.0300670Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.0301241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.0301713Z module_map=module_map) 2025-05-07T20:32:50.0302088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.0302448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.0302714Z E ^ 2025-05-07T20:32:50.0303180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.0303628Z 2025-05-07T20:32:50.0304051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.0304554Z 2025-05-07T20:32:50.0304669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.0305077Z self=, 2025-05-07T20:32:50.0305482Z T=128, 2025-05-07T20:32:50.0305677Z D=5120, 2025-05-07T20:32:50.0305882Z scale_ub=None, 2025-05-07T20:32:50.0306105Z contiguous=False, 2025-05-07T20:32:50.0306341Z compiled=True, 2025-05-07T20:32:50.0306545Z ) 2025-05-07T20:32:50.0306868Z self = 2025-05-07T20:32:50.0307365Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.0307633Z 2025-05-07T20:32:50.0307716Z @given( 2025-05-07T20:32:50.0307956Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.0308275Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.0308591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.0308914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.0309240Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.0309524Z ) 2025-05-07T20:32:50.0309869Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.0310309Z def test_silu_mul_quant( 2025-05-07T20:32:50.0310561Z self, 2025-05-07T20:32:50.0310759Z T: int, 2025-05-07T20:32:50.0310961Z D: int, 2025-05-07T20:32:50.0311199Z scale_ub: Optional[float], 2025-05-07T20:32:50.0311505Z contiguous: bool, 2025-05-07T20:32:50.0311753Z compiled: bool, 2025-05-07T20:32:50.0311979Z ) -> None: 2025-05-07T20:32:50.0312190Z torch.manual_seed(2025) 2025-05-07T20:32:50.0312439Z 2025-05-07T20:32:50.0312716Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.0313054Z 2025-05-07T20:32:50.0313244Z x_sign = torch.sign(x) 2025-05-07T20:32:50.0313537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.0313852Z x = x_sign * x_clamp 2025-05-07T20:32:50.0314094Z x0 = x[:, :D] 2025-05-07T20:32:50.0314372Z x1 = x[:, D:] 2025-05-07T20:32:50.0314584Z 2025-05-07T20:32:50.0314776Z if contiguous: 2025-05-07T20:32:50.0315014Z x0 = x0.contiguous() 2025-05-07T20:32:50.0315353Z x1 = x1.contiguous() 2025-05-07T20:32:50.0315596Z 2025-05-07T20:32:50.0315796Z if scale_ub is not None: 2025-05-07T20:32:50.0316074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.0316403Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.0323982Z ) 2025-05-07T20:32:50.0324203Z else: 2025-05-07T20:32:50.0324416Z scale_ub_tensor = None 2025-05-07T20:32:50.0324768Z 2025-05-07T20:32:50.0325013Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.0325330Z op = silu_mul_quant 2025-05-07T20:32:50.0325590Z if compiled: 2025-05-07T20:32:50.0325848Z op = torch.compile(op) 2025-05-07T20:32:50.0326142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.0326426Z 2025-05-07T20:32:50.0326627Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.0326796Z 2025-05-07T20:32:50.0326906Z moe/activation_test.py:117: 2025-05-07T20:32:50.0327254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0327591Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.0327879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.0328427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.0328983Z return fn(*args, **kwargs) 2025-05-07T20:32:50.0329640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.0330319Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.0330843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.0331515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.0332184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.0332703Z kernel = self.compile( 2025-05-07T20:32:50.0333337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.0333989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.0334387Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.0334615Z 2025-05-07T20:32:50.0334820Z self = 2025-05-07T20:32:50.0335889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.0337249Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a9db1a0>} 2025-05-07T20:32:50.0338570Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.0339576Z context = 2025-05-07T20:32:50.0339858Z 2025-05-07T20:32:50.0340026Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.0340543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.0341010Z module_map=module_map) 2025-05-07T20:32:50.0341415Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.0341773Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.0342085Z E ^ 2025-05-07T20:32:50.0342587Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.0343029Z 2025-05-07T20:32:50.0343440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.0343960Z 2025-05-07T20:32:50.0344065Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.0344476Z self=, 2025-05-07T20:32:50.0344906Z T=128, 2025-05-07T20:32:50.0345099Z D=7168, 2025-05-07T20:32:50.0345296Z scale_ub=1200.0, 2025-05-07T20:32:50.0345531Z contiguous=False, 2025-05-07T20:32:50.0345753Z compiled=False, 2025-05-07T20:32:50.0345964Z ) 2025-05-07T20:32:50.1487762Z self = 2025-05-07T20:32:50.1488425Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.1488773Z 2025-05-07T20:32:50.1488856Z @given( 2025-05-07T20:32:50.1489107Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.1489729Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.1490051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.1490391Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.1490714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.1491005Z ) 2025-05-07T20:32:50.1491370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.1491871Z def test_silu_mul_quant( 2025-05-07T20:32:50.1492120Z self, 2025-05-07T20:32:50.1492322Z T: int, 2025-05-07T20:32:50.1492532Z D: int, 2025-05-07T20:32:50.1492752Z scale_ub: Optional[float], 2025-05-07T20:32:50.1493106Z contiguous: bool, 2025-05-07T20:32:50.1493351Z compiled: bool, 2025-05-07T20:32:50.1493588Z ) -> None: 2025-05-07T20:32:50.1493809Z torch.manual_seed(2025) 2025-05-07T20:32:50.1494057Z 2025-05-07T20:32:50.1494340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.1494694Z 2025-05-07T20:32:50.1494902Z x_sign = torch.sign(x) 2025-05-07T20:32:50.1495195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.1495511Z x = x_sign * x_clamp 2025-05-07T20:32:50.1495764Z x0 = x[:, :D] 2025-05-07T20:32:50.1495980Z x1 = x[:, D:] 2025-05-07T20:32:50.1496202Z 2025-05-07T20:32:50.1496401Z if contiguous: 2025-05-07T20:32:50.1496638Z x0 = x0.contiguous() 2025-05-07T20:32:50.1496911Z x1 = x1.contiguous() 2025-05-07T20:32:50.1497160Z 2025-05-07T20:32:50.1497357Z if scale_ub is not None: 2025-05-07T20:32:50.1497643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.1497986Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.1498306Z ) 2025-05-07T20:32:50.1498501Z else: 2025-05-07T20:32:50.1498722Z scale_ub_tensor = None 2025-05-07T20:32:50.1498995Z 2025-05-07T20:32:50.1499234Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1499595Z op = silu_mul_quant 2025-05-07T20:32:50.1499845Z if compiled: 2025-05-07T20:32:50.1500102Z op = torch.compile(op) 2025-05-07T20:32:50.1500402Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1500680Z 2025-05-07T20:32:50.1500883Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.1501050Z 2025-05-07T20:32:50.1501158Z moe/activation_test.py:117: 2025-05-07T20:32:50.1501453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1501791Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.1502080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1502765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.1503547Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.1504168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.1504850Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.1505507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.1506037Z kernel = self.compile( 2025-05-07T20:32:50.1506657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.1507311Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.1507712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1507947Z 2025-05-07T20:32:50.1508158Z self = 2025-05-07T20:32:50.1509285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.1510653Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88b3c80e0>} 2025-05-07T20:32:50.1512237Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.1513510Z context = 2025-05-07T20:32:50.1513813Z 2025-05-07T20:32:50.1513983Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.1514508Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.1514978Z module_map=module_map) 2025-05-07T20:32:50.1515357Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.1515716Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.1515979Z E ^ 2025-05-07T20:32:50.1516449Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.1516901Z 2025-05-07T20:32:50.1517320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.1517827Z 2025-05-07T20:32:50.1517940Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1518348Z self=, 2025-05-07T20:32:50.1518759Z T=128, 2025-05-07T20:32:50.1518956Z D=5120, 2025-05-07T20:32:50.1519161Z scale_ub=None, 2025-05-07T20:32:50.1519375Z contiguous=False, 2025-05-07T20:32:50.1519616Z compiled=False, 2025-05-07T20:32:50.1519844Z ) 2025-05-07T20:32:50.1520171Z self = 2025-05-07T20:32:50.1520672Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:50.1521005Z 2025-05-07T20:32:50.1521115Z @given( 2025-05-07T20:32:50.1521403Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.1521804Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.1522203Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.1522613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.1523006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.1523302Z ) 2025-05-07T20:32:50.1523660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.1524148Z def test_silu_mul_quant( 2025-05-07T20:32:50.1524397Z self, 2025-05-07T20:32:50.1524602Z T: int, 2025-05-07T20:32:50.1524799Z D: int, 2025-05-07T20:32:50.1525069Z scale_ub: Optional[float], 2025-05-07T20:32:50.1525356Z contiguous: bool, 2025-05-07T20:32:50.1525599Z compiled: bool, 2025-05-07T20:32:50.1525830Z ) -> None: 2025-05-07T20:32:50.1526054Z torch.manual_seed(2025) 2025-05-07T20:32:50.1526293Z 2025-05-07T20:32:50.1526578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.1526922Z 2025-05-07T20:32:50.1527162Z x_sign = torch.sign(x) 2025-05-07T20:32:50.1527455Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.1527770Z x = x_sign * x_clamp 2025-05-07T20:32:50.1528013Z x0 = x[:, :D] 2025-05-07T20:32:50.1528239Z x1 = x[:, D:] 2025-05-07T20:32:50.1528458Z 2025-05-07T20:32:50.1528651Z if contiguous: 2025-05-07T20:32:50.1528895Z x0 = x0.contiguous() 2025-05-07T20:32:50.1529163Z x1 = x1.contiguous() 2025-05-07T20:32:50.1529402Z 2025-05-07T20:32:50.1529605Z if scale_ub is not None: 2025-05-07T20:32:50.1529935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.1530269Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.1530587Z ) 2025-05-07T20:32:50.1530791Z else: 2025-05-07T20:32:50.1531013Z scale_ub_tensor = None 2025-05-07T20:32:50.1531264Z 2025-05-07T20:32:50.1531498Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1531817Z op = silu_mul_quant 2025-05-07T20:32:50.1532064Z if compiled: 2025-05-07T20:32:50.1532321Z op = torch.compile(op) 2025-05-07T20:32:50.1532620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1532892Z 2025-05-07T20:32:50.1533154Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.1533321Z 2025-05-07T20:32:50.1533429Z moe/activation_test.py:117: 2025-05-07T20:32:50.1533721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1534058Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.1534344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1535032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.1535711Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.1536265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.1536949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.1537606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.1538142Z kernel = self.compile( 2025-05-07T20:32:50.1538690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.1539338Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.1539745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1539980Z 2025-05-07T20:32:50.1540187Z self = 2025-05-07T20:32:50.1541284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.1542665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88b3bea20>} 2025-05-07T20:32:50.1543981Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.1545126Z context = 2025-05-07T20:32:50.1545421Z 2025-05-07T20:32:50.1545591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.1546111Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.1546572Z module_map=module_map) 2025-05-07T20:32:50.1546941Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.1547341Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.1547601Z E ^ 2025-05-07T20:32:50.1548066Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.1548518Z 2025-05-07T20:32:50.1548930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.1549437Z 2025-05-07T20:32:50.1549548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1550002Z self=, 2025-05-07T20:32:50.1550407Z T=128, 2025-05-07T20:32:50.1550602Z D=5120, 2025-05-07T20:32:50.1550796Z scale_ub=1200.0, 2025-05-07T20:32:50.1551022Z contiguous=True, 2025-05-07T20:32:50.1551262Z compiled=False, 2025-05-07T20:32:50.1551501Z ) 2025-05-07T20:32:50.3288612Z self = 2025-05-07T20:32:50.3290075Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.3290830Z 2025-05-07T20:32:50.3291053Z @given( 2025-05-07T20:32:50.3291534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.3291893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.3292207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.3292545Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.3292875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.3293251Z ) 2025-05-07T20:32:50.3293606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.3294047Z def test_silu_mul_quant( 2025-05-07T20:32:50.3294288Z self, 2025-05-07T20:32:50.3294495Z T: int, 2025-05-07T20:32:50.3294698Z D: int, 2025-05-07T20:32:50.3294916Z scale_ub: Optional[float], 2025-05-07T20:32:50.3295203Z contiguous: bool, 2025-05-07T20:32:50.3295453Z compiled: bool, 2025-05-07T20:32:50.3295687Z ) -> None: 2025-05-07T20:32:50.3295907Z torch.manual_seed(2025) 2025-05-07T20:32:50.3296156Z 2025-05-07T20:32:50.3296441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.3296782Z 2025-05-07T20:32:50.3296987Z x_sign = torch.sign(x) 2025-05-07T20:32:50.3297287Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.3297602Z x = x_sign * x_clamp 2025-05-07T20:32:50.3297860Z x0 = x[:, :D] 2025-05-07T20:32:50.3298094Z x1 = x[:, D:] 2025-05-07T20:32:50.3298302Z 2025-05-07T20:32:50.3298507Z if contiguous: 2025-05-07T20:32:50.3298752Z x0 = x0.contiguous() 2025-05-07T20:32:50.3299015Z x1 = x1.contiguous() 2025-05-07T20:32:50.3299270Z 2025-05-07T20:32:50.3299473Z if scale_ub is not None: 2025-05-07T20:32:50.3299751Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.3300096Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.3300410Z ) 2025-05-07T20:32:50.3300609Z else: 2025-05-07T20:32:50.3300829Z scale_ub_tensor = None 2025-05-07T20:32:50.3301092Z 2025-05-07T20:32:50.3301333Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.3301644Z op = silu_mul_quant 2025-05-07T20:32:50.3302182Z if compiled: 2025-05-07T20:32:50.3302440Z op = torch.compile(op) 2025-05-07T20:32:50.3302814Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3303092Z 2025-05-07T20:32:50.3303300Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.3303466Z 2025-05-07T20:32:50.3303565Z moe/activation_test.py:117: 2025-05-07T20:32:50.3303871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3304205Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.3304485Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3305252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.3305937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.3306474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.3307149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.3307873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.3308409Z kernel = self.compile( 2025-05-07T20:32:50.3308952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.3309600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.3309997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3310227Z 2025-05-07T20:32:50.3310440Z self = 2025-05-07T20:32:50.3311552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.3312937Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fb39120>} 2025-05-07T20:32:50.3314262Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.3315277Z context = 2025-05-07T20:32:50.3315563Z 2025-05-07T20:32:50.3315740Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.3316257Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.3316732Z module_map=module_map) 2025-05-07T20:32:50.3317102Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.3317456Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.3317732Z E ^ 2025-05-07T20:32:50.3318207Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.3318652Z 2025-05-07T20:32:50.3319071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.3319574Z 2025-05-07T20:32:50.3319680Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.3320098Z self=, 2025-05-07T20:32:50.3320512Z T=1, 2025-05-07T20:32:50.3320711Z D=7168, 2025-05-07T20:32:50.3320905Z scale_ub=1200.0, 2025-05-07T20:32:50.3321135Z contiguous=True, 2025-05-07T20:32:50.3321369Z compiled=True, 2025-05-07T20:32:50.3321579Z ) 2025-05-07T20:32:50.3321906Z self = 2025-05-07T20:32:50.3322398Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:50.3322709Z 2025-05-07T20:32:50.3322791Z @given( 2025-05-07T20:32:50.3323075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.3323395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.3323706Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.3324041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.3324374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.3324665Z ) 2025-05-07T20:32:50.3325010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.3325497Z def test_silu_mul_quant( 2025-05-07T20:32:50.3325744Z self, 2025-05-07T20:32:50.3325939Z T: int, 2025-05-07T20:32:50.3326145Z D: int, 2025-05-07T20:32:50.3326366Z scale_ub: Optional[float], 2025-05-07T20:32:50.3326640Z contiguous: bool, 2025-05-07T20:32:50.3326889Z compiled: bool, 2025-05-07T20:32:50.3327127Z ) -> None: 2025-05-07T20:32:50.3327344Z torch.manual_seed(2025) 2025-05-07T20:32:50.3327594Z 2025-05-07T20:32:50.3327915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.3328255Z 2025-05-07T20:32:50.3328458Z x_sign = torch.sign(x) 2025-05-07T20:32:50.3328756Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.3329075Z x = x_sign * x_clamp 2025-05-07T20:32:50.3329330Z x0 = x[:, :D] 2025-05-07T20:32:50.3329556Z x1 = x[:, D:] 2025-05-07T20:32:50.3329779Z 2025-05-07T20:32:50.3329982Z if contiguous: 2025-05-07T20:32:50.3330222Z x0 = x0.contiguous() 2025-05-07T20:32:50.3330491Z x1 = x1.contiguous() 2025-05-07T20:32:50.3330739Z 2025-05-07T20:32:50.3330946Z if scale_ub is not None: 2025-05-07T20:32:50.3331236Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.3331572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.3331893Z ) 2025-05-07T20:32:50.3332105Z else: 2025-05-07T20:32:50.3332324Z scale_ub_tensor = None 2025-05-07T20:32:50.3332591Z 2025-05-07T20:32:50.3332841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.3333242Z op = silu_mul_quant 2025-05-07T20:32:50.3333507Z if compiled: 2025-05-07T20:32:50.3333769Z op = torch.compile(op) 2025-05-07T20:32:50.3334064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3334348Z 2025-05-07T20:32:50.3334565Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.3334732Z 2025-05-07T20:32:50.3334841Z moe/activation_test.py:117: 2025-05-07T20:32:50.3335139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3335474Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.3335767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.3336325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.3336894Z return fn(*args, **kwargs) 2025-05-07T20:32:50.3337569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.3338258Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.3338798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.3339480Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.3340152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.3340678Z kernel = self.compile( 2025-05-07T20:32:50.3341237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.3341930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.3342380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.3342645Z 2025-05-07T20:32:50.3342857Z self = 2025-05-07T20:32:50.3343926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.3345279Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fb3a2a0>} 2025-05-07T20:32:50.3346645Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.3347657Z context = 2025-05-07T20:32:50.3347952Z 2025-05-07T20:32:50.3348199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.3348724Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.3349197Z module_map=module_map) 2025-05-07T20:32:50.3349563Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.3349926Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.3350199Z E ^ 2025-05-07T20:32:50.3350668Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.3351123Z 2025-05-07T20:32:50.3351571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.3352103Z 2025-05-07T20:32:50.3352212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.3352637Z self=, 2025-05-07T20:32:50.3353031Z T=1, 2025-05-07T20:32:50.3353228Z D=7168, 2025-05-07T20:32:50.3353444Z scale_ub=1200.0, 2025-05-07T20:32:50.3361615Z contiguous=False, 2025-05-07T20:32:50.3361874Z compiled=True, 2025-05-07T20:32:50.3362085Z ) 2025-05-07T20:32:50.4679182Z self = 2025-05-07T20:32:50.4679876Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.4680200Z 2025-05-07T20:32:50.4680281Z @given( 2025-05-07T20:32:50.4680523Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4680841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4681145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4681484Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4681819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4682110Z ) 2025-05-07T20:32:50.4682463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4682922Z def test_silu_mul_quant( 2025-05-07T20:32:50.4683167Z self, 2025-05-07T20:32:50.4683363Z T: int, 2025-05-07T20:32:50.4683567Z D: int, 2025-05-07T20:32:50.4683787Z scale_ub: Optional[float], 2025-05-07T20:32:50.4684055Z contiguous: bool, 2025-05-07T20:32:50.4684302Z compiled: bool, 2025-05-07T20:32:50.4684538Z ) -> None: 2025-05-07T20:32:50.4684761Z torch.manual_seed(2025) 2025-05-07T20:32:50.4685017Z 2025-05-07T20:32:50.4685304Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4685653Z 2025-05-07T20:32:50.4685855Z x_sign = torch.sign(x) 2025-05-07T20:32:50.4686153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.4686458Z x = x_sign * x_clamp 2025-05-07T20:32:50.4686991Z x0 = x[:, :D] 2025-05-07T20:32:50.4687216Z x1 = x[:, D:] 2025-05-07T20:32:50.4687424Z 2025-05-07T20:32:50.4687716Z if contiguous: 2025-05-07T20:32:50.4687963Z x0 = x0.contiguous() 2025-05-07T20:32:50.4688222Z x1 = x1.contiguous() 2025-05-07T20:32:50.4688468Z 2025-05-07T20:32:50.4688665Z if scale_ub is not None: 2025-05-07T20:32:50.4688944Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.4689276Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.4689591Z ) 2025-05-07T20:32:50.4689879Z else: 2025-05-07T20:32:50.4690094Z scale_ub_tensor = None 2025-05-07T20:32:50.4690356Z 2025-05-07T20:32:50.4690597Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.4690908Z op = silu_mul_quant 2025-05-07T20:32:50.4691171Z if compiled: 2025-05-07T20:32:50.4691422Z op = torch.compile(op) 2025-05-07T20:32:50.4691719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4691997Z 2025-05-07T20:32:50.4692197Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.4692369Z 2025-05-07T20:32:50.4692550Z moe/activation_test.py:117: 2025-05-07T20:32:50.4692855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4693302Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.4693587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4694139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.4694699Z return fn(*args, **kwargs) 2025-05-07T20:32:50.4695351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.4696027Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.4696563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.4697240Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.4697897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.4698418Z kernel = self.compile( 2025-05-07T20:32:50.4698958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.4699607Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.4699999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4700230Z 2025-05-07T20:32:50.4700439Z self = 2025-05-07T20:32:50.4701536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.4702945Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fb3b9c0>} 2025-05-07T20:32:50.4704274Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.4705284Z context = 2025-05-07T20:32:50.4705581Z 2025-05-07T20:32:50.4705745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.4706267Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.4706734Z module_map=module_map) 2025-05-07T20:32:50.4707097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.4707502Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.4707769Z E ^ 2025-05-07T20:32:50.4708278Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.4708731Z 2025-05-07T20:32:50.4709142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.4709654Z 2025-05-07T20:32:50.4709761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.4710176Z self=, 2025-05-07T20:32:50.4710625Z T=1, 2025-05-07T20:32:50.4710821Z D=7168, 2025-05-07T20:32:50.4711028Z scale_ub=None, 2025-05-07T20:32:50.4711247Z contiguous=False, 2025-05-07T20:32:50.4711483Z compiled=True, 2025-05-07T20:32:50.4711699Z ) 2025-05-07T20:32:50.5585845Z self = 2025-05-07T20:32:50.5586616Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.5586987Z 2025-05-07T20:32:50.5587102Z @given( 2025-05-07T20:32:50.5587601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.5587924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.5588239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.5588587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.5588927Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.5589211Z ) 2025-05-07T20:32:50.5589580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.5590040Z def test_silu_mul_quant( 2025-05-07T20:32:50.5590284Z self, 2025-05-07T20:32:50.5590495Z T: int, 2025-05-07T20:32:50.5590710Z D: int, 2025-05-07T20:32:50.5590937Z scale_ub: Optional[float], 2025-05-07T20:32:50.5591227Z contiguous: bool, 2025-05-07T20:32:50.5591481Z compiled: bool, 2025-05-07T20:32:50.5591712Z ) -> None: 2025-05-07T20:32:50.5591940Z torch.manual_seed(2025) 2025-05-07T20:32:50.5592200Z 2025-05-07T20:32:50.5592484Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.5592840Z 2025-05-07T20:32:50.5593047Z x_sign = torch.sign(x) 2025-05-07T20:32:50.5593350Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.5593671Z x = x_sign * x_clamp 2025-05-07T20:32:50.5593927Z x0 = x[:, :D] 2025-05-07T20:32:50.5594156Z x1 = x[:, D:] 2025-05-07T20:32:50.5594370Z 2025-05-07T20:32:50.5594571Z if contiguous: 2025-05-07T20:32:50.5594812Z x0 = x0.contiguous() 2025-05-07T20:32:50.5595075Z x1 = x1.contiguous() 2025-05-07T20:32:50.5595329Z 2025-05-07T20:32:50.5595528Z if scale_ub is not None: 2025-05-07T20:32:50.5595807Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.5596153Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.5596476Z ) 2025-05-07T20:32:50.5596678Z else: 2025-05-07T20:32:50.5596906Z scale_ub_tensor = None 2025-05-07T20:32:50.5597171Z 2025-05-07T20:32:50.5597405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5597730Z op = silu_mul_quant 2025-05-07T20:32:50.5597995Z if compiled: 2025-05-07T20:32:50.5598245Z op = torch.compile(op) 2025-05-07T20:32:50.5598552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5598839Z 2025-05-07T20:32:50.5599045Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.5599338Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.5599637Z 2025-05-07T20:32:50.5599891Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5600227Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.5600533Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.5600948Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.5601386Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.5601712Z 2025-05-07T20:32:50.5601926Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.5602121Z 2025-05-07T20:32:50.5602224Z moe/activation_test.py:126: 2025-05-07T20:32:50.5602532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5602887Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.5603219Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.5604083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.5604837Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.5605378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.5606061Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.5606794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.5607512Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.5608232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.5608869Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.5609477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.5609997Z fn() 2025-05-07T20:32:50.5610501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.5611083Z self.fn.run( 2025-05-07T20:32:50.5611560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.5612085Z kernel = self.compile( 2025-05-07T20:32:50.5612629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.5613377Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.5613775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5614003Z 2025-05-07T20:32:50.5614210Z self = 2025-05-07T20:32:50.5615285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.5616650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f8ccb80>} 2025-05-07T20:32:50.5617984Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.5618987Z context = 2025-05-07T20:32:50.5619279Z 2025-05-07T20:32:50.5619445Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.5619970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.5620437Z module_map=module_map) 2025-05-07T20:32:50.5620798Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.5621164Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.5621468Z E ^ 2025-05-07T20:32:50.5621956Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.5622457Z 2025-05-07T20:32:50.5622928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.5623446Z 2025-05-07T20:32:50.5623557Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.5623977Z self=, 2025-05-07T20:32:50.5624374Z T=1, 2025-05-07T20:32:50.5624563Z D=5120, 2025-05-07T20:32:50.5624844Z scale_ub=1200.0, 2025-05-07T20:32:50.5625075Z contiguous=False, 2025-05-07T20:32:50.5625316Z compiled=True, 2025-05-07T20:32:50.5625532Z ) 2025-05-07T20:32:50.7174626Z self = 2025-05-07T20:32:50.7175353Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.7175700Z 2025-05-07T20:32:50.7175802Z @given( 2025-05-07T20:32:50.7176045Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7176361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7176948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7177287Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7177623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7177916Z ) 2025-05-07T20:32:50.7178268Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7178713Z def test_silu_mul_quant( 2025-05-07T20:32:50.7178969Z self, 2025-05-07T20:32:50.7179168Z T: int, 2025-05-07T20:32:50.7179377Z D: int, 2025-05-07T20:32:50.7179598Z scale_ub: Optional[float], 2025-05-07T20:32:50.7179868Z contiguous: bool, 2025-05-07T20:32:50.7180111Z compiled: bool, 2025-05-07T20:32:50.7180341Z ) -> None: 2025-05-07T20:32:50.7180552Z torch.manual_seed(2025) 2025-05-07T20:32:50.7180807Z 2025-05-07T20:32:50.7181088Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7181442Z 2025-05-07T20:32:50.7181680Z x_sign = torch.sign(x) 2025-05-07T20:32:50.7181971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.7182270Z x = x_sign * x_clamp 2025-05-07T20:32:50.7182512Z x0 = x[:, :D] 2025-05-07T20:32:50.7182731Z x1 = x[:, D:] 2025-05-07T20:32:50.7182947Z 2025-05-07T20:32:50.7183132Z if contiguous: 2025-05-07T20:32:50.7183370Z x0 = x0.contiguous() 2025-05-07T20:32:50.7183632Z x1 = x1.contiguous() 2025-05-07T20:32:50.7183868Z 2025-05-07T20:32:50.7184063Z if scale_ub is not None: 2025-05-07T20:32:50.7184340Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.7184672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.7184986Z ) 2025-05-07T20:32:50.7185184Z else: 2025-05-07T20:32:50.7185406Z scale_ub_tensor = None 2025-05-07T20:32:50.7185663Z 2025-05-07T20:32:50.7185899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.7186210Z op = silu_mul_quant 2025-05-07T20:32:50.7186467Z if compiled: 2025-05-07T20:32:50.7186721Z op = torch.compile(op) 2025-05-07T20:32:50.7187017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7187298Z 2025-05-07T20:32:50.7187495Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.7187658Z 2025-05-07T20:32:50.7187762Z moe/activation_test.py:117: 2025-05-07T20:32:50.7188057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7188386Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.7188669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7189221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.7189862Z return fn(*args, **kwargs) 2025-05-07T20:32:50.7190594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.7191275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.7191798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.7192468Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.7193128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.7193721Z kernel = self.compile( 2025-05-07T20:32:50.7194260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.7194906Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.7195297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7195525Z 2025-05-07T20:32:50.7195734Z self = 2025-05-07T20:32:50.7196842Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.7198202Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f8cde40>} 2025-05-07T20:32:50.7199521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.7200526Z context = 2025-05-07T20:32:50.7200812Z 2025-05-07T20:32:50.7200975Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.7201511Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.7202017Z module_map=module_map) 2025-05-07T20:32:50.7202373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.7202723Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.7202980Z E ^ 2025-05-07T20:32:50.7203433Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.7203884Z 2025-05-07T20:32:50.7204293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.7204799Z 2025-05-07T20:32:50.7204902Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7205310Z self=, 2025-05-07T20:32:50.7205703Z T=1, 2025-05-07T20:32:50.7205891Z D=5120, 2025-05-07T20:32:50.7206089Z scale_ub=1200.0, 2025-05-07T20:32:50.7206313Z contiguous=False, 2025-05-07T20:32:50.7206543Z compiled=False, 2025-05-07T20:32:50.7206754Z ) 2025-05-07T20:32:50.7207077Z self = 2025-05-07T20:32:50.7207553Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.7207821Z 2025-05-07T20:32:50.7207898Z @given( 2025-05-07T20:32:50.7208128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7208439Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7208753Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7209085Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7209410Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7209698Z ) 2025-05-07T20:32:50.7210052Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7210545Z def test_silu_mul_quant( 2025-05-07T20:32:50.7210827Z self, 2025-05-07T20:32:50.7211034Z T: int, 2025-05-07T20:32:50.7211235Z D: int, 2025-05-07T20:32:50.7211455Z scale_ub: Optional[float], 2025-05-07T20:32:50.7211733Z contiguous: bool, 2025-05-07T20:32:50.7211976Z compiled: bool, 2025-05-07T20:32:50.7212196Z ) -> None: 2025-05-07T20:32:50.7212423Z torch.manual_seed(2025) 2025-05-07T20:32:50.7212671Z 2025-05-07T20:32:50.7213115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7213463Z 2025-05-07T20:32:50.7213661Z x_sign = torch.sign(x) 2025-05-07T20:32:50.7213952Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.7214268Z x = x_sign * x_clamp 2025-05-07T20:32:50.7214517Z x0 = x[:, :D] 2025-05-07T20:32:50.7214728Z x1 = x[:, D:] 2025-05-07T20:32:50.7214942Z 2025-05-07T20:32:50.7215134Z if contiguous: 2025-05-07T20:32:50.7215365Z x0 = x0.contiguous() 2025-05-07T20:32:50.7215679Z x1 = x1.contiguous() 2025-05-07T20:32:50.7215926Z 2025-05-07T20:32:50.7216122Z if scale_ub is not None: 2025-05-07T20:32:50.7216394Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.7216736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.7217045Z ) 2025-05-07T20:32:50.7217242Z else: 2025-05-07T20:32:50.7217462Z scale_ub_tensor = None 2025-05-07T20:32:50.7217725Z 2025-05-07T20:32:50.7217957Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.7218277Z op = silu_mul_quant 2025-05-07T20:32:50.7218533Z if compiled: 2025-05-07T20:32:50.7218785Z op = torch.compile(op) 2025-05-07T20:32:50.7219092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7219374Z 2025-05-07T20:32:50.7219572Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.7219748Z 2025-05-07T20:32:50.7219849Z moe/activation_test.py:117: 2025-05-07T20:32:50.7220155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7220491Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.7220777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7221470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.7222159Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.7222700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.7223383Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.7224057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.7224599Z kernel = self.compile( 2025-05-07T20:32:50.7225137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.7225796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.7226200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7226425Z 2025-05-07T20:32:50.7226640Z self = 2025-05-07T20:32:50.7227698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.7229050Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f8ceac0>} 2025-05-07T20:32:50.7230463Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.7231509Z context = 2025-05-07T20:32:50.7231818Z 2025-05-07T20:32:50.7231981Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.7232504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.7233021Z module_map=module_map) 2025-05-07T20:32:50.7233388Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.7233743Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.7234005Z E ^ 2025-05-07T20:32:50.7234472Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.7234919Z 2025-05-07T20:32:50.7235332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.7235848Z 2025-05-07T20:32:50.7235998Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7236421Z self=, 2025-05-07T20:32:50.7236817Z T=16384, 2025-05-07T20:32:50.7237008Z D=5120, 2025-05-07T20:32:50.7237207Z scale_ub=1200.0, 2025-05-07T20:32:50.7237435Z contiguous=False, 2025-05-07T20:32:50.7237656Z compiled=True, 2025-05-07T20:32:50.7237863Z ) 2025-05-07T20:32:50.8114303Z self = 2025-05-07T20:32:50.8115056Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.8115432Z 2025-05-07T20:32:50.8115543Z @given( 2025-05-07T20:32:50.8115841Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8116273Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8116670Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8117071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8117402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8117693Z ) 2025-05-07T20:32:50.8118047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8118487Z def test_silu_mul_quant( 2025-05-07T20:32:50.8118734Z self, 2025-05-07T20:32:50.8118942Z T: int, 2025-05-07T20:32:50.8119148Z D: int, 2025-05-07T20:32:50.8119380Z scale_ub: Optional[float], 2025-05-07T20:32:50.8119680Z contiguous: bool, 2025-05-07T20:32:50.8119932Z compiled: bool, 2025-05-07T20:32:50.8120168Z ) -> None: 2025-05-07T20:32:50.8120392Z torch.manual_seed(2025) 2025-05-07T20:32:50.8128533Z 2025-05-07T20:32:50.8128822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8129167Z 2025-05-07T20:32:50.8129365Z x_sign = torch.sign(x) 2025-05-07T20:32:50.8129659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.8129977Z x = x_sign * x_clamp 2025-05-07T20:32:50.8130218Z x0 = x[:, :D] 2025-05-07T20:32:50.8130419Z x1 = x[:, D:] 2025-05-07T20:32:50.8130629Z 2025-05-07T20:32:50.8130816Z if contiguous: 2025-05-07T20:32:50.8131049Z x0 = x0.contiguous() 2025-05-07T20:32:50.8131311Z x1 = x1.contiguous() 2025-05-07T20:32:50.8131559Z 2025-05-07T20:32:50.8131754Z if scale_ub is not None: 2025-05-07T20:32:50.8132036Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.8132382Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.8132694Z ) 2025-05-07T20:32:50.8132890Z else: 2025-05-07T20:32:50.8133171Z scale_ub_tensor = None 2025-05-07T20:32:50.8133427Z 2025-05-07T20:32:50.8133925Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.8134249Z op = silu_mul_quant 2025-05-07T20:32:50.8134504Z if compiled: 2025-05-07T20:32:50.8134831Z op = torch.compile(op) 2025-05-07T20:32:50.8135137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8135413Z 2025-05-07T20:32:50.8135604Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.8135783Z 2025-05-07T20:32:50.8135882Z moe/activation_test.py:117: 2025-05-07T20:32:50.8136181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8136596Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.8136877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8137431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.8137994Z return fn(*args, **kwargs) 2025-05-07T20:32:50.8138635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.8139319Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.8139927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.8140599Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.8141247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.8141819Z kernel = self.compile( 2025-05-07T20:32:50.8142364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.8143001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.8143397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8143633Z 2025-05-07T20:32:50.8143838Z self = 2025-05-07T20:32:50.8144914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.8146270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ff4c180>} 2025-05-07T20:32:50.8147598Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.8148610Z context = 2025-05-07T20:32:50.8148894Z 2025-05-07T20:32:50.8149065Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.8149585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.8150043Z module_map=module_map) 2025-05-07T20:32:50.8150413Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.8150771Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.8151029Z E ^ 2025-05-07T20:32:50.8151497Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.8151938Z 2025-05-07T20:32:50.8152354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.8152860Z 2025-05-07T20:32:50.8152973Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.8153377Z self=, 2025-05-07T20:32:50.8153777Z T=2048, 2025-05-07T20:32:50.8153969Z D=7168, 2025-05-07T20:32:50.8154213Z scale_ub=1200.0, 2025-05-07T20:32:50.8154441Z contiguous=False, 2025-05-07T20:32:50.8154668Z compiled=True, 2025-05-07T20:32:50.8154870Z ) 2025-05-07T20:32:50.8155236Z self = 2025-05-07T20:32:50.8155732Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.8156000Z 2025-05-07T20:32:50.8156088Z @given( 2025-05-07T20:32:50.8156317Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8156631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8157000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8157325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8157654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8157938Z ) 2025-05-07T20:32:50.8158278Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8158720Z def test_silu_mul_quant( 2025-05-07T20:32:50.8158962Z self, 2025-05-07T20:32:50.8159158Z T: int, 2025-05-07T20:32:50.8159758Z D: int, 2025-05-07T20:32:50.8160051Z scale_ub: Optional[float], 2025-05-07T20:32:50.8160322Z contiguous: bool, 2025-05-07T20:32:50.8160568Z compiled: bool, 2025-05-07T20:32:50.8160797Z ) -> None: 2025-05-07T20:32:50.8161012Z torch.manual_seed(2025) 2025-05-07T20:32:50.8161252Z 2025-05-07T20:32:50.8161526Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8161872Z 2025-05-07T20:32:50.8162069Z x_sign = torch.sign(x) 2025-05-07T20:32:50.8162361Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.8162676Z x = x_sign * x_clamp 2025-05-07T20:32:50.8162920Z x0 = x[:, :D] 2025-05-07T20:32:50.8163144Z x1 = x[:, D:] 2025-05-07T20:32:50.8163352Z 2025-05-07T20:32:50.8163543Z if contiguous: 2025-05-07T20:32:50.8163783Z x0 = x0.contiguous() 2025-05-07T20:32:50.8164051Z x1 = x1.contiguous() 2025-05-07T20:32:50.8164294Z 2025-05-07T20:32:50.8164500Z if scale_ub is not None: 2025-05-07T20:32:50.8164785Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.8165115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.8165431Z ) 2025-05-07T20:32:50.8165628Z else: 2025-05-07T20:32:50.8165836Z scale_ub_tensor = None 2025-05-07T20:32:50.8166093Z 2025-05-07T20:32:50.8166331Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.8166651Z op = silu_mul_quant 2025-05-07T20:32:50.8166897Z if compiled: 2025-05-07T20:32:50.8167150Z op = torch.compile(op) 2025-05-07T20:32:50.8167454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8167724Z 2025-05-07T20:32:50.8167924Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.8168090Z 2025-05-07T20:32:50.8168198Z moe/activation_test.py:117: 2025-05-07T20:32:50.8168494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8168838Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.8169131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8169686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.8170250Z return fn(*args, **kwargs) 2025-05-07T20:32:50.8170907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.8171614Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.8172183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.8172862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.8173607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.8174204Z kernel = self.compile( 2025-05-07T20:32:50.8174794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.8175443Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.8175842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8176068Z 2025-05-07T20:32:50.8176274Z self = 2025-05-07T20:32:50.8177411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.8178767Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ff4cea0>} 2025-05-07T20:32:50.8180156Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.8181170Z context = 2025-05-07T20:32:50.8181460Z 2025-05-07T20:32:50.8181628Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.8182191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.8182658Z module_map=module_map) 2025-05-07T20:32:50.8183029Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.8183380Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.8183646Z E ^ 2025-05-07T20:32:50.8184109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.8184556Z 2025-05-07T20:32:50.8184980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.8185492Z 2025-05-07T20:32:50.9337691Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9338409Z self=, 2025-05-07T20:32:50.9338954Z T=1, 2025-05-07T20:32:50.9339212Z D=5120, 2025-05-07T20:32:50.9339413Z scale_ub=None, 2025-05-07T20:32:50.9339635Z contiguous=False, 2025-05-07T20:32:50.9339899Z compiled=False, 2025-05-07T20:32:50.9340113Z ) 2025-05-07T20:32:50.9340428Z self = 2025-05-07T20:32:50.9340920Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:50.9341179Z 2025-05-07T20:32:50.9341265Z @given( 2025-05-07T20:32:50.9341497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9341827Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9342139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9342477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9342803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9343101Z ) 2025-05-07T20:32:50.9343455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9343906Z def test_silu_mul_quant( 2025-05-07T20:32:50.9344152Z self, 2025-05-07T20:32:50.9344355Z T: int, 2025-05-07T20:32:50.9344561Z D: int, 2025-05-07T20:32:50.9344781Z scale_ub: Optional[float], 2025-05-07T20:32:50.9345052Z contiguous: bool, 2025-05-07T20:32:50.9345296Z compiled: bool, 2025-05-07T20:32:50.9345520Z ) -> None: 2025-05-07T20:32:50.9345746Z torch.manual_seed(2025) 2025-05-07T20:32:50.9345993Z 2025-05-07T20:32:50.9346520Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9346866Z 2025-05-07T20:32:50.9347065Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9347444Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9347756Z x = x_sign * x_clamp 2025-05-07T20:32:50.9348001Z x0 = x[:, :D] 2025-05-07T20:32:50.9348218Z x1 = x[:, D:] 2025-05-07T20:32:50.9348432Z 2025-05-07T20:32:50.9348622Z if contiguous: 2025-05-07T20:32:50.9348853Z x0 = x0.contiguous() 2025-05-07T20:32:50.9349114Z x1 = x1.contiguous() 2025-05-07T20:32:50.9349467Z 2025-05-07T20:32:50.9349656Z if scale_ub is not None: 2025-05-07T20:32:50.9349929Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9350267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9350578Z ) 2025-05-07T20:32:50.9350773Z else: 2025-05-07T20:32:50.9350991Z scale_ub_tensor = None 2025-05-07T20:32:50.9351246Z 2025-05-07T20:32:50.9351472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9351788Z op = silu_mul_quant 2025-05-07T20:32:50.9352131Z if compiled: 2025-05-07T20:32:50.9352382Z op = torch.compile(op) 2025-05-07T20:32:50.9352688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9352964Z 2025-05-07T20:32:50.9353164Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.9353338Z 2025-05-07T20:32:50.9353439Z moe/activation_test.py:117: 2025-05-07T20:32:50.9353741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9354075Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.9354362Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9355049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.9355731Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.9356268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9356950Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9357609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9358137Z kernel = self.compile( 2025-05-07T20:32:50.9358680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9359629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9360030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9360256Z 2025-05-07T20:32:50.9360465Z self = 2025-05-07T20:32:50.9361532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9362900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ff4de40>} 2025-05-07T20:32:50.9364218Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9365230Z context = 2025-05-07T20:32:50.9365512Z 2025-05-07T20:32:50.9365678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9366193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9366733Z module_map=module_map) 2025-05-07T20:32:50.9367097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9367513Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.9367783Z E ^ 2025-05-07T20:32:50.9368253Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9368700Z 2025-05-07T20:32:50.9369113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.9369624Z 2025-05-07T20:32:50.9369792Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9370209Z self=, 2025-05-07T20:32:50.9370603Z T=4096, 2025-05-07T20:32:50.9370800Z D=7168, 2025-05-07T20:32:50.9370998Z scale_ub=1200.0, 2025-05-07T20:32:50.9371231Z contiguous=False, 2025-05-07T20:32:50.9371456Z compiled=False, 2025-05-07T20:32:50.9371669Z ) 2025-05-07T20:32:50.9371995Z self = 2025-05-07T20:32:50.9372548Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.9372829Z 2025-05-07T20:32:50.9372910Z @given( 2025-05-07T20:32:50.9373235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9373547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9373859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9374190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9374517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9374805Z ) 2025-05-07T20:32:50.9375158Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9375600Z def test_silu_mul_quant( 2025-05-07T20:32:50.9375841Z self, 2025-05-07T20:32:50.9376042Z T: int, 2025-05-07T20:32:50.9376248Z D: int, 2025-05-07T20:32:50.9376476Z scale_ub: Optional[float], 2025-05-07T20:32:50.9376759Z contiguous: bool, 2025-05-07T20:32:50.9377013Z compiled: bool, 2025-05-07T20:32:50.9377241Z ) -> None: 2025-05-07T20:32:50.9377469Z torch.manual_seed(2025) 2025-05-07T20:32:50.9377720Z 2025-05-07T20:32:50.9377990Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9378343Z 2025-05-07T20:32:50.9378543Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9378836Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9379154Z x = x_sign * x_clamp 2025-05-07T20:32:50.9379401Z x0 = x[:, :D] 2025-05-07T20:32:50.9379615Z x1 = x[:, D:] 2025-05-07T20:32:50.9379829Z 2025-05-07T20:32:50.9380019Z if contiguous: 2025-05-07T20:32:50.9380257Z x0 = x0.contiguous() 2025-05-07T20:32:50.9380522Z x1 = x1.contiguous() 2025-05-07T20:32:50.9380769Z 2025-05-07T20:32:50.9380972Z if scale_ub is not None: 2025-05-07T20:32:50.9381245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9381590Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9381909Z ) 2025-05-07T20:32:50.9382107Z else: 2025-05-07T20:32:50.9382329Z scale_ub_tensor = None 2025-05-07T20:32:50.9382593Z 2025-05-07T20:32:50.9382826Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9383152Z op = silu_mul_quant 2025-05-07T20:32:50.9383412Z if compiled: 2025-05-07T20:32:50.9383666Z op = torch.compile(op) 2025-05-07T20:32:50.9383969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9384253Z 2025-05-07T20:32:50.9384442Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.9384612Z 2025-05-07T20:32:50.9384715Z moe/activation_test.py:117: 2025-05-07T20:32:50.9385013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9385399Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.9385677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9386404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.9387093Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.9387625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9388312Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9389013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9389553Z kernel = self.compile( 2025-05-07T20:32:50.9390092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9390750Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9391163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9391393Z 2025-05-07T20:32:50.9391677Z self = 2025-05-07T20:32:50.9392767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9394127Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ff4f380>} 2025-05-07T20:32:50.9395456Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9396478Z context = 2025-05-07T20:32:50.9396766Z 2025-05-07T20:32:50.9396937Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9397465Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9397946Z module_map=module_map) 2025-05-07T20:32:50.9398323Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9398681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.9398953Z E ^ 2025-05-07T20:32:50.9399428Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9399874Z 2025-05-07T20:32:50.9400293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.9400807Z 2025-05-07T20:32:50.9400918Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9401341Z self=, 2025-05-07T20:32:50.9401747Z T=16384, 2025-05-07T20:32:50.9401943Z D=7168, 2025-05-07T20:32:50.9402147Z scale_ub=None, 2025-05-07T20:32:50.9402373Z contiguous=True, 2025-05-07T20:32:50.9402600Z compiled=True, 2025-05-07T20:32:50.9402815Z ) 2025-05-07T20:32:51.1149412Z self = 2025-05-07T20:32:51.1150082Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.1150360Z 2025-05-07T20:32:51.1150460Z @given( 2025-05-07T20:32:51.1150707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.1151037Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.1151347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.1151675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.1152006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.1152556Z ) 2025-05-07T20:32:51.1152903Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.1153442Z def test_silu_mul_quant( 2025-05-07T20:32:51.1153698Z self, 2025-05-07T20:32:51.1153887Z T: int, 2025-05-07T20:32:51.1154086Z D: int, 2025-05-07T20:32:51.1154311Z scale_ub: Optional[float], 2025-05-07T20:32:51.1154575Z contiguous: bool, 2025-05-07T20:32:51.1154816Z compiled: bool, 2025-05-07T20:32:51.1155051Z ) -> None: 2025-05-07T20:32:51.1155266Z torch.manual_seed(2025) 2025-05-07T20:32:51.1155597Z 2025-05-07T20:32:51.1155876Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.1156217Z 2025-05-07T20:32:51.1156414Z x_sign = torch.sign(x) 2025-05-07T20:32:51.1156708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.1157021Z x = x_sign * x_clamp 2025-05-07T20:32:51.1157264Z x0 = x[:, :D] 2025-05-07T20:32:51.1157485Z x1 = x[:, D:] 2025-05-07T20:32:51.1157698Z 2025-05-07T20:32:51.1157883Z if contiguous: 2025-05-07T20:32:51.1158196Z x0 = x0.contiguous() 2025-05-07T20:32:51.1158461Z x1 = x1.contiguous() 2025-05-07T20:32:51.1158694Z 2025-05-07T20:32:51.1158889Z if scale_ub is not None: 2025-05-07T20:32:51.1159162Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.1159768Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.1160079Z ) 2025-05-07T20:32:51.1160280Z else: 2025-05-07T20:32:51.1160491Z scale_ub_tensor = None 2025-05-07T20:32:51.1160749Z 2025-05-07T20:32:51.1160985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.1161296Z op = silu_mul_quant 2025-05-07T20:32:51.1161554Z if compiled: 2025-05-07T20:32:51.1161816Z op = torch.compile(op) 2025-05-07T20:32:51.1162116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1162385Z 2025-05-07T20:32:51.1162582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.1162749Z 2025-05-07T20:32:51.1162860Z moe/activation_test.py:117: 2025-05-07T20:32:51.1163152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1163479Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.1163761Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1164313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.1164871Z return fn(*args, **kwargs) 2025-05-07T20:32:51.1165534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.1166215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.1166748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.1167428Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.1168094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.1168638Z kernel = self.compile( 2025-05-07T20:32:51.1176440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.1177108Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.1177519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1177752Z 2025-05-07T20:32:51.1177968Z self = 2025-05-07T20:32:51.1179035Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.1180583Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a3f44a0>} 2025-05-07T20:32:51.1181908Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.1182920Z context = 2025-05-07T20:32:51.1183272Z 2025-05-07T20:32:51.1183447Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.1183957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.1184423Z module_map=module_map) 2025-05-07T20:32:51.1184790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.1185141Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.1185407Z E ^ 2025-05-07T20:32:51.1185928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.1186377Z 2025-05-07T20:32:51.1186799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.1187303Z 2025-05-07T20:32:51.1187409Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.1187820Z self=, 2025-05-07T20:32:51.1188230Z T=4096, 2025-05-07T20:32:51.1188414Z D=5120, 2025-05-07T20:32:51.1188615Z scale_ub=None, 2025-05-07T20:32:51.1188837Z contiguous=False, 2025-05-07T20:32:51.1189061Z compiled=True, 2025-05-07T20:32:51.1189273Z ) 2025-05-07T20:32:51.1189591Z self = 2025-05-07T20:32:51.1190085Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.1190352Z 2025-05-07T20:32:51.1190431Z @given( 2025-05-07T20:32:51.1190671Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.1190987Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.1191287Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.1191623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.1191996Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.1192272Z ) 2025-05-07T20:32:51.1192628Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.1193065Z def test_silu_mul_quant( 2025-05-07T20:32:51.1193307Z self, 2025-05-07T20:32:51.1193498Z T: int, 2025-05-07T20:32:51.1193699Z D: int, 2025-05-07T20:32:51.1193917Z scale_ub: Optional[float], 2025-05-07T20:32:51.1194182Z contiguous: bool, 2025-05-07T20:32:51.1194434Z compiled: bool, 2025-05-07T20:32:51.1194657Z ) -> None: 2025-05-07T20:32:51.1194871Z torch.manual_seed(2025) 2025-05-07T20:32:51.1195120Z 2025-05-07T20:32:51.1195402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.1195741Z 2025-05-07T20:32:51.1195943Z x_sign = torch.sign(x) 2025-05-07T20:32:51.1196243Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.1196554Z x = x_sign * x_clamp 2025-05-07T20:32:51.1196799Z x0 = x[:, :D] 2025-05-07T20:32:51.1197022Z x1 = x[:, D:] 2025-05-07T20:32:51.1197231Z 2025-05-07T20:32:51.1197426Z if contiguous: 2025-05-07T20:32:51.1197663Z x0 = x0.contiguous() 2025-05-07T20:32:51.1197917Z x1 = x1.contiguous() 2025-05-07T20:32:51.1198161Z 2025-05-07T20:32:51.1198359Z if scale_ub is not None: 2025-05-07T20:32:51.1198638Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.1199040Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.1199356Z ) 2025-05-07T20:32:51.1199556Z else: 2025-05-07T20:32:51.1199810Z scale_ub_tensor = None 2025-05-07T20:32:51.1200069Z 2025-05-07T20:32:51.1200307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.1200615Z op = silu_mul_quant 2025-05-07T20:32:51.1200872Z if compiled: 2025-05-07T20:32:51.1201123Z op = torch.compile(op) 2025-05-07T20:32:51.1201412Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1201732Z 2025-05-07T20:32:51.1201934Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.1202123Z 2025-05-07T20:32:51.1202246Z moe/activation_test.py:117: 2025-05-07T20:32:51.1202548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1202880Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.1203163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1203714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.1204273Z return fn(*args, **kwargs) 2025-05-07T20:32:51.1204973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.1205646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.1206177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.1206849Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.1207509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.1208027Z kernel = self.compile( 2025-05-07T20:32:51.1208564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.1209212Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.1209610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1209835Z 2025-05-07T20:32:51.1210042Z self = 2025-05-07T20:32:51.1211108Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.1212517Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a3f51c0>} 2025-05-07T20:32:51.1213917Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.1214923Z context = 2025-05-07T20:32:51.1215217Z 2025-05-07T20:32:51.1215388Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.1215907Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.1216389Z module_map=module_map) 2025-05-07T20:32:51.1216753Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.1217126Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.1217400Z E ^ 2025-05-07T20:32:51.1217857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.1218310Z 2025-05-07T20:32:51.1218720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.1219279Z 2025-05-07T20:32:51.4291192Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4291793Z self=, 2025-05-07T20:32:51.4292510Z T=4096, 2025-05-07T20:32:51.4292706Z D=5120, 2025-05-07T20:32:51.4292908Z scale_ub=1200.0, 2025-05-07T20:32:51.4293265Z contiguous=False, 2025-05-07T20:32:51.4293498Z compiled=False, 2025-05-07T20:32:51.4293717Z ) 2025-05-07T20:32:51.4294050Z self = 2025-05-07T20:32:51.4294550Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.4294925Z 2025-05-07T20:32:51.4295008Z @given( 2025-05-07T20:32:51.4295252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4295568Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4295874Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4296210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4296545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4296828Z ) 2025-05-07T20:32:51.4297270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4297717Z def test_silu_mul_quant( 2025-05-07T20:32:51.4297960Z self, 2025-05-07T20:32:51.4298167Z T: int, 2025-05-07T20:32:51.4298375Z D: int, 2025-05-07T20:32:51.4298595Z scale_ub: Optional[float], 2025-05-07T20:32:51.4298879Z contiguous: bool, 2025-05-07T20:32:51.4299127Z compiled: bool, 2025-05-07T20:32:51.4299359Z ) -> None: 2025-05-07T20:32:51.4299581Z torch.manual_seed(2025) 2025-05-07T20:32:51.4299831Z 2025-05-07T20:32:51.4300101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4300454Z 2025-05-07T20:32:51.4300657Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4300954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4301263Z x = x_sign * x_clamp 2025-05-07T20:32:51.4301520Z x0 = x[:, :D] 2025-05-07T20:32:51.4301742Z x1 = x[:, D:] 2025-05-07T20:32:51.4301989Z 2025-05-07T20:32:51.4302182Z if contiguous: 2025-05-07T20:32:51.4302423Z x0 = x0.contiguous() 2025-05-07T20:32:51.4302695Z x1 = x1.contiguous() 2025-05-07T20:32:51.4302951Z 2025-05-07T20:32:51.4303153Z if scale_ub is not None: 2025-05-07T20:32:51.4303441Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4303789Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4304098Z ) 2025-05-07T20:32:51.4304306Z else: 2025-05-07T20:32:51.4304533Z scale_ub_tensor = None 2025-05-07T20:32:51.4304791Z 2025-05-07T20:32:51.4305036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4305358Z op = silu_mul_quant 2025-05-07T20:32:51.4305613Z if compiled: 2025-05-07T20:32:51.4305879Z op = torch.compile(op) 2025-05-07T20:32:51.4306189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4306477Z 2025-05-07T20:32:51.4306692Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4306870Z 2025-05-07T20:32:51.4306975Z moe/activation_test.py:117: 2025-05-07T20:32:51.4307289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4307630Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4307932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4308629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4309335Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4309880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4310571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4311324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4311910Z kernel = self.compile( 2025-05-07T20:32:51.4312456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4313109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4313514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4313741Z 2025-05-07T20:32:51.4314026Z self = 2025-05-07T20:32:51.4315099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4316474Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a3f6160>} 2025-05-07T20:32:51.4317850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4318866Z context = 2025-05-07T20:32:51.4319157Z 2025-05-07T20:32:51.4319326Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4319849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4320318Z module_map=module_map) 2025-05-07T20:32:51.4320680Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4321040Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4321303Z E ^ 2025-05-07T20:32:51.4321773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4322222Z 2025-05-07T20:32:51.4322644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4323157Z 2025-05-07T20:32:51.4323265Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4323684Z self=, 2025-05-07T20:32:51.4324088Z T=4096, 2025-05-07T20:32:51.4324277Z D=5120, 2025-05-07T20:32:51.4324482Z scale_ub=1200.0, 2025-05-07T20:32:51.4324710Z contiguous=False, 2025-05-07T20:32:51.4324936Z compiled=True, 2025-05-07T20:32:51.4325141Z ) 2025-05-07T20:32:51.4325476Z self = 2025-05-07T20:32:51.4325985Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4326272Z 2025-05-07T20:32:51.4326353Z @given( 2025-05-07T20:32:51.4326594Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4326917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4327234Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4327574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4327907Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4328201Z ) 2025-05-07T20:32:51.4328558Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4329002Z def test_silu_mul_quant( 2025-05-07T20:32:51.4329241Z self, 2025-05-07T20:32:51.4329443Z T: int, 2025-05-07T20:32:51.4329647Z D: int, 2025-05-07T20:32:51.4329863Z scale_ub: Optional[float], 2025-05-07T20:32:51.4330145Z contiguous: bool, 2025-05-07T20:32:51.4330390Z compiled: bool, 2025-05-07T20:32:51.4330612Z ) -> None: 2025-05-07T20:32:51.4330830Z torch.manual_seed(2025) 2025-05-07T20:32:51.4331134Z 2025-05-07T20:32:51.4331404Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4331788Z 2025-05-07T20:32:51.4331992Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4332285Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4332598Z x = x_sign * x_clamp 2025-05-07T20:32:51.4332840Z x0 = x[:, :D] 2025-05-07T20:32:51.4333129Z x1 = x[:, D:] 2025-05-07T20:32:51.4333341Z 2025-05-07T20:32:51.4333534Z if contiguous: 2025-05-07T20:32:51.4333809Z x0 = x0.contiguous() 2025-05-07T20:32:51.4334072Z x1 = x1.contiguous() 2025-05-07T20:32:51.4334316Z 2025-05-07T20:32:51.4334506Z if scale_ub is not None: 2025-05-07T20:32:51.4334783Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4335120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4335428Z ) 2025-05-07T20:32:51.4335620Z else: 2025-05-07T20:32:51.4335839Z scale_ub_tensor = None 2025-05-07T20:32:51.4336099Z 2025-05-07T20:32:51.4336374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4336699Z op = silu_mul_quant 2025-05-07T20:32:51.4336955Z if compiled: 2025-05-07T20:32:51.4337201Z op = torch.compile(op) 2025-05-07T20:32:51.4337499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4337783Z 2025-05-07T20:32:51.4337972Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4338144Z 2025-05-07T20:32:51.4338247Z moe/activation_test.py:117: 2025-05-07T20:32:51.4338548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4338881Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4339160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4339719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4340284Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4340941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4341636Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4342217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4342896Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4343552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4344084Z kernel = self.compile( 2025-05-07T20:32:51.4344630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4345280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4345685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4345922Z 2025-05-07T20:32:51.4346137Z self = 2025-05-07T20:32:51.4347204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4348557Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a3f7240>} 2025-05-07T20:32:51.4349891Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4350905Z context = 2025-05-07T20:32:51.4351247Z 2025-05-07T20:32:51.4351423Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4351981Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4352446Z module_map=module_map) 2025-05-07T20:32:51.4352811Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4353166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4353422Z E ^ 2025-05-07T20:32:51.4353884Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4354371Z 2025-05-07T20:32:51.4354797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4355301Z 2025-05-07T20:32:51.5507288Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.5507764Z self=, 2025-05-07T20:32:51.5508189Z T=2048, 2025-05-07T20:32:51.5508388Z D=7168, 2025-05-07T20:32:51.5508591Z scale_ub=1200.0, 2025-05-07T20:32:51.5508987Z contiguous=False, 2025-05-07T20:32:51.5509231Z compiled=False, 2025-05-07T20:32:51.5509440Z ) 2025-05-07T20:32:51.5509768Z self = 2025-05-07T20:32:51.5510270Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.5510547Z 2025-05-07T20:32:51.5510636Z @given( 2025-05-07T20:32:51.5510875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.5511196Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.5511506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.5511856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.5512229Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.5512519Z ) 2025-05-07T20:32:51.5512868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.5513315Z def test_silu_mul_quant( 2025-05-07T20:32:51.5513571Z self, 2025-05-07T20:32:51.5513780Z T: int, 2025-05-07T20:32:51.5513987Z D: int, 2025-05-07T20:32:51.5514220Z scale_ub: Optional[float], 2025-05-07T20:32:51.5514498Z contiguous: bool, 2025-05-07T20:32:51.5514739Z compiled: bool, 2025-05-07T20:32:51.5514976Z ) -> None: 2025-05-07T20:32:51.5515198Z torch.manual_seed(2025) 2025-05-07T20:32:51.5515439Z 2025-05-07T20:32:51.5515724Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.5516078Z 2025-05-07T20:32:51.5516275Z x_sign = torch.sign(x) 2025-05-07T20:32:51.5516576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.5516893Z x = x_sign * x_clamp 2025-05-07T20:32:51.5517138Z x0 = x[:, :D] 2025-05-07T20:32:51.5517372Z x1 = x[:, D:] 2025-05-07T20:32:51.5517594Z 2025-05-07T20:32:51.5517788Z if contiguous: 2025-05-07T20:32:51.5518036Z x0 = x0.contiguous() 2025-05-07T20:32:51.5518304Z x1 = x1.contiguous() 2025-05-07T20:32:51.5518549Z 2025-05-07T20:32:51.5518755Z if scale_ub is not None: 2025-05-07T20:32:51.5519043Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.5519395Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.5519714Z ) 2025-05-07T20:32:51.5519920Z else: 2025-05-07T20:32:51.5520154Z scale_ub_tensor = None 2025-05-07T20:32:51.5520416Z 2025-05-07T20:32:51.5520657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.5520981Z op = silu_mul_quant 2025-05-07T20:32:51.5521240Z if compiled: 2025-05-07T20:32:51.5521500Z op = torch.compile(op) 2025-05-07T20:32:51.5521809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.5522211Z 2025-05-07T20:32:51.5522433Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.5522602Z 2025-05-07T20:32:51.5522787Z moe/activation_test.py:117: 2025-05-07T20:32:51.5523097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.5523439Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.5523735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.5524431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.5525190Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.5525731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.5526419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.5527076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.5527620Z kernel = self.compile( 2025-05-07T20:32:51.5528220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.5528876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.5529276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.5529515Z 2025-05-07T20:32:51.5529723Z self = 2025-05-07T20:32:51.5530796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.5532164Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f230220>} 2025-05-07T20:32:51.5533626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.5534639Z context = 2025-05-07T20:32:51.5534935Z 2025-05-07T20:32:51.5535104Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.5535628Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.5536102Z module_map=module_map) 2025-05-07T20:32:51.5536468Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.5536827Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.5537097Z E ^ 2025-05-07T20:32:51.5537556Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.5538009Z 2025-05-07T20:32:51.5538425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.5538937Z 2025-05-07T20:32:51.5539053Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.5539471Z self=, 2025-05-07T20:32:51.5539876Z T=1, 2025-05-07T20:32:51.5540066Z D=7168, 2025-05-07T20:32:51.5540269Z scale_ub=None, 2025-05-07T20:32:51.5540484Z contiguous=True, 2025-05-07T20:32:51.5540719Z compiled=False, 2025-05-07T20:32:51.5540945Z ) 2025-05-07T20:32:51.5541266Z self = 2025-05-07T20:32:51.5541767Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.5542069Z 2025-05-07T20:32:51.5542159Z @given( 2025-05-07T20:32:51.5542396Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.5542768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.5543082Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.5543494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.5543827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.5544122Z ) 2025-05-07T20:32:51.5544481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.5544925Z def test_silu_mul_quant( 2025-05-07T20:32:51.5545176Z self, 2025-05-07T20:32:51.5545381Z T: int, 2025-05-07T20:32:51.5545621Z D: int, 2025-05-07T20:32:51.5545846Z scale_ub: Optional[float], 2025-05-07T20:32:51.5546125Z contiguous: bool, 2025-05-07T20:32:51.5553223Z compiled: bool, 2025-05-07T20:32:51.5553496Z ) -> None: 2025-05-07T20:32:51.5553720Z torch.manual_seed(2025) 2025-05-07T20:32:51.5553975Z 2025-05-07T20:32:51.5554260Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.5554605Z 2025-05-07T20:32:51.5554813Z x_sign = torch.sign(x) 2025-05-07T20:32:51.5555191Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.5555506Z x = x_sign * x_clamp 2025-05-07T20:32:51.5555746Z x0 = x[:, :D] 2025-05-07T20:32:51.5555969Z x1 = x[:, D:] 2025-05-07T20:32:51.5556186Z 2025-05-07T20:32:51.5556371Z if contiguous: 2025-05-07T20:32:51.5556607Z x0 = x0.contiguous() 2025-05-07T20:32:51.5556870Z x1 = x1.contiguous() 2025-05-07T20:32:51.5557112Z 2025-05-07T20:32:51.5557321Z if scale_ub is not None: 2025-05-07T20:32:51.5557599Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.5557930Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.5558246Z ) 2025-05-07T20:32:51.5558444Z else: 2025-05-07T20:32:51.5558657Z scale_ub_tensor = None 2025-05-07T20:32:51.5558919Z 2025-05-07T20:32:51.5559162Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.5559754Z op = silu_mul_quant 2025-05-07T20:32:51.5560013Z if compiled: 2025-05-07T20:32:51.5560274Z op = torch.compile(op) 2025-05-07T20:32:51.5560566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.5560850Z 2025-05-07T20:32:51.5561056Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.5561224Z 2025-05-07T20:32:51.5561330Z moe/activation_test.py:117: 2025-05-07T20:32:51.5561619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.5561960Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.5562296Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.5562973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.5563658Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.5564197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.5564887Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.5565549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.5566076Z kernel = self.compile( 2025-05-07T20:32:51.5566612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.5567260Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.5567651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.5567881Z 2025-05-07T20:32:51.5568088Z self = 2025-05-07T20:32:51.5569159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.5570670Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f231120>} 2025-05-07T20:32:51.5571985Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.5573114Z context = 2025-05-07T20:32:51.5573407Z 2025-05-07T20:32:51.5573574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.5574094Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.5574552Z module_map=module_map) 2025-05-07T20:32:51.5574923Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.5575289Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.5575611Z E ^ 2025-05-07T20:32:51.5576078Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.5576527Z 2025-05-07T20:32:51.5576938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.5577441Z 2025-05-07T20:32:51.5577552Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.5577965Z self=, 2025-05-07T20:32:51.5578371Z T=16384, 2025-05-07T20:32:51.5578571Z D=7168, 2025-05-07T20:32:51.5578764Z scale_ub=1200.0, 2025-05-07T20:32:51.5578994Z contiguous=False, 2025-05-07T20:32:51.5579223Z compiled=True, 2025-05-07T20:32:51.7981766Z ) 2025-05-07T20:32:51.7982203Z self = 2025-05-07T20:32:51.7982723Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.7983007Z 2025-05-07T20:32:51.7983088Z @given( 2025-05-07T20:32:51.7983331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7983648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7983960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7984302Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7984639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7984935Z ) 2025-05-07T20:32:51.7985290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7985738Z def test_silu_mul_quant( 2025-05-07T20:32:51.7985976Z self, 2025-05-07T20:32:51.7986179Z T: int, 2025-05-07T20:32:51.7986390Z D: int, 2025-05-07T20:32:51.7986615Z scale_ub: Optional[float], 2025-05-07T20:32:51.7986887Z contiguous: bool, 2025-05-07T20:32:51.7987137Z compiled: bool, 2025-05-07T20:32:51.7987370Z ) -> None: 2025-05-07T20:32:51.7987583Z torch.manual_seed(2025) 2025-05-07T20:32:51.7987828Z 2025-05-07T20:32:51.7988109Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7988445Z 2025-05-07T20:32:51.7988652Z x_sign = torch.sign(x) 2025-05-07T20:32:51.7988950Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.7989265Z x = x_sign * x_clamp 2025-05-07T20:32:51.7989524Z x0 = x[:, :D] 2025-05-07T20:32:51.7989747Z x1 = x[:, D:] 2025-05-07T20:32:51.7989951Z 2025-05-07T20:32:51.7990147Z if contiguous: 2025-05-07T20:32:51.7990391Z x0 = x0.contiguous() 2025-05-07T20:32:51.7990652Z x1 = x1.contiguous() 2025-05-07T20:32:51.7990903Z 2025-05-07T20:32:51.7991108Z if scale_ub is not None: 2025-05-07T20:32:51.7991635Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.7991975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.7992381Z ) 2025-05-07T20:32:51.7992590Z else: 2025-05-07T20:32:51.7992804Z scale_ub_tensor = None 2025-05-07T20:32:51.7993071Z 2025-05-07T20:32:51.7993322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7993639Z op = silu_mul_quant 2025-05-07T20:32:51.7993898Z if compiled: 2025-05-07T20:32:51.7994165Z op = torch.compile(op) 2025-05-07T20:32:51.7994543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7994833Z 2025-05-07T20:32:51.7995038Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.7995203Z 2025-05-07T20:32:51.7995303Z moe/activation_test.py:117: 2025-05-07T20:32:51.7995603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7995935Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.7996223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7996849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.7997409Z return fn(*args, **kwargs) 2025-05-07T20:32:51.7998068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.7998743Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.7999277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7999964Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.8000621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.8001143Z kernel = self.compile( 2025-05-07T20:32:51.8001681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.8002335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.8002724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8002957Z 2025-05-07T20:32:51.8003165Z self = 2025-05-07T20:32:51.8004236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.8005600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f232520>} 2025-05-07T20:32:51.8006924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.8007940Z context = 2025-05-07T20:32:51.8008232Z 2025-05-07T20:32:51.8008397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.8008917Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.8009385Z module_map=module_map) 2025-05-07T20:32:51.8009751Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.8010115Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.8010382Z E ^ 2025-05-07T20:32:51.8010840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.8011292Z 2025-05-07T20:32:51.8011703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.8012323Z 2025-05-07T20:32:51.8012432Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8012897Z self=, 2025-05-07T20:32:51.8013406Z T=1, 2025-05-07T20:32:51.8013600Z D=7168, 2025-05-07T20:32:51.8013803Z scale_ub=None, 2025-05-07T20:32:51.8014015Z contiguous=False, 2025-05-07T20:32:51.8014252Z compiled=False, 2025-05-07T20:32:51.8014462Z ) 2025-05-07T20:32:51.8014777Z self = 2025-05-07T20:32:51.8015312Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.8015579Z 2025-05-07T20:32:51.8015657Z @given( 2025-05-07T20:32:51.8015893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.8016204Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.8016516Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.8016854Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.8017187Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.8017533Z ) 2025-05-07T20:32:51.8017886Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.8018330Z def test_silu_mul_quant( 2025-05-07T20:32:51.8018570Z self, 2025-05-07T20:32:51.8018773Z T: int, 2025-05-07T20:32:51.8018976Z D: int, 2025-05-07T20:32:51.8019192Z scale_ub: Optional[float], 2025-05-07T20:32:51.8019464Z contiguous: bool, 2025-05-07T20:32:51.8019710Z compiled: bool, 2025-05-07T20:32:51.8019935Z ) -> None: 2025-05-07T20:32:51.8020156Z torch.manual_seed(2025) 2025-05-07T20:32:51.8020407Z 2025-05-07T20:32:51.8020679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.8021018Z 2025-05-07T20:32:51.8021222Z x_sign = torch.sign(x) 2025-05-07T20:32:51.8021510Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.8021826Z x = x_sign * x_clamp 2025-05-07T20:32:51.8022083Z x0 = x[:, :D] 2025-05-07T20:32:51.8022340Z x1 = x[:, D:] 2025-05-07T20:32:51.8022567Z 2025-05-07T20:32:51.8022762Z if contiguous: 2025-05-07T20:32:51.8022996Z x0 = x0.contiguous() 2025-05-07T20:32:51.8023255Z x1 = x1.contiguous() 2025-05-07T20:32:51.8023506Z 2025-05-07T20:32:51.8023708Z if scale_ub is not None: 2025-05-07T20:32:51.8023983Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.8024327Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.8024639Z ) 2025-05-07T20:32:51.8024832Z else: 2025-05-07T20:32:51.8025053Z scale_ub_tensor = None 2025-05-07T20:32:51.8025312Z 2025-05-07T20:32:51.8025547Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.8025864Z op = silu_mul_quant 2025-05-07T20:32:51.8026138Z if compiled: 2025-05-07T20:32:51.8026388Z op = torch.compile(op) 2025-05-07T20:32:51.8026693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8026980Z 2025-05-07T20:32:51.8027173Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.8027353Z 2025-05-07T20:32:51.8027455Z moe/activation_test.py:117: 2025-05-07T20:32:51.8027762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8028097Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.8028383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8029074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.8029762Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.8030294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.8031021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.8031752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.8032276Z kernel = self.compile( 2025-05-07T20:32:51.8032808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.8033462Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.8033860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8034131Z 2025-05-07T20:32:51.8034339Z self = 2025-05-07T20:32:51.8035410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.8036837Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f233100>} 2025-05-07T20:32:51.8038157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.8039166Z context = 2025-05-07T20:32:51.8039451Z 2025-05-07T20:32:51.8039619Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.8040137Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.8040605Z module_map=module_map) 2025-05-07T20:32:51.8040974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.8041322Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.8041588Z E ^ 2025-05-07T20:32:51.8042094Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.8042555Z 2025-05-07T20:32:51.8042963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.8043471Z 2025-05-07T20:32:51.8043575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8043985Z self=, 2025-05-07T20:32:51.8044387Z T=2048, 2025-05-07T20:32:51.8044574Z D=7168, 2025-05-07T20:32:51.8044768Z scale_ub=None, 2025-05-07T20:32:51.8044984Z contiguous=False, 2025-05-07T20:32:51.8045204Z compiled=True, 2025-05-07T20:32:51.8045410Z ) 2025-05-07T20:32:51.8920742Z self = 2025-05-07T20:32:51.8921777Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.8922069Z 2025-05-07T20:32:51.8922149Z @given( 2025-05-07T20:32:51.8922402Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.8922875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.8923184Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.8923515Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.8923837Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.8924127Z ) 2025-05-07T20:32:51.8924487Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.8924922Z def test_silu_mul_quant( 2025-05-07T20:32:51.8925166Z self, 2025-05-07T20:32:51.8925365Z T: int, 2025-05-07T20:32:51.8925561Z D: int, 2025-05-07T20:32:51.8925785Z scale_ub: Optional[float], 2025-05-07T20:32:51.8926067Z contiguous: bool, 2025-05-07T20:32:51.8926307Z compiled: bool, 2025-05-07T20:32:51.8926764Z ) -> None: 2025-05-07T20:32:51.8926992Z torch.manual_seed(2025) 2025-05-07T20:32:51.8927245Z 2025-05-07T20:32:51.8927598Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.8927952Z 2025-05-07T20:32:51.8928159Z x_sign = torch.sign(x) 2025-05-07T20:32:51.8928451Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.8928765Z x = x_sign * x_clamp 2025-05-07T20:32:51.8929021Z x0 = x[:, :D] 2025-05-07T20:32:51.8929239Z x1 = x[:, D:] 2025-05-07T20:32:51.8929449Z 2025-05-07T20:32:51.8929724Z if contiguous: 2025-05-07T20:32:51.8929950Z x0 = x0.contiguous() 2025-05-07T20:32:51.8930216Z x1 = x1.contiguous() 2025-05-07T20:32:51.8930464Z 2025-05-07T20:32:51.8930655Z if scale_ub is not None: 2025-05-07T20:32:51.8930936Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.8931269Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.8931577Z ) 2025-05-07T20:32:51.8931776Z else: 2025-05-07T20:32:51.8931995Z scale_ub_tensor = None 2025-05-07T20:32:51.8932329Z 2025-05-07T20:32:51.8932567Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.8932885Z op = silu_mul_quant 2025-05-07T20:32:51.8933236Z if compiled: 2025-05-07T20:32:51.8933481Z op = torch.compile(op) 2025-05-07T20:32:51.8933774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8934048Z 2025-05-07T20:32:51.8934242Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.8934414Z 2025-05-07T20:32:51.8934518Z moe/activation_test.py:117: 2025-05-07T20:32:51.8934819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8935139Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.8935420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8935972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.8936535Z return fn(*args, **kwargs) 2025-05-07T20:32:51.8937188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.8937870Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.8938408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.8939079Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.8939745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.8940280Z kernel = self.compile( 2025-05-07T20:32:51.8940828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.8941473Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.8941881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8942114Z 2025-05-07T20:32:51.8942330Z self = 2025-05-07T20:32:51.8943393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.8944742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a0f8720>} 2025-05-07T20:32:51.8946065Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.8947131Z context = 2025-05-07T20:32:51.8947416Z 2025-05-07T20:32:51.8947628Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.8948136Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.8948604Z module_map=module_map) 2025-05-07T20:32:51.8948971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.8949322Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.8949579Z E ^ 2025-05-07T20:32:51.8950083Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.8950523Z 2025-05-07T20:32:51.8950936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.8951437Z 2025-05-07T20:32:51.8951548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8952006Z self=, 2025-05-07T20:32:51.8952401Z T=4096, 2025-05-07T20:32:51.8952639Z D=7168, 2025-05-07T20:32:51.8952826Z scale_ub=None, 2025-05-07T20:32:51.8953042Z contiguous=False, 2025-05-07T20:32:51.8953268Z compiled=True, 2025-05-07T20:32:51.8953471Z ) 2025-05-07T20:32:51.8953790Z self = 2025-05-07T20:32:51.8954286Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.8954554Z 2025-05-07T20:32:51.8954629Z @given( 2025-05-07T20:32:51.8954865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.8955182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.8955494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.8955823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.8956155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.8956438Z ) 2025-05-07T20:32:51.8956785Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.8957231Z def test_silu_mul_quant( 2025-05-07T20:32:51.8957479Z self, 2025-05-07T20:32:51.8957672Z T: int, 2025-05-07T20:32:51.8957880Z D: int, 2025-05-07T20:32:51.8958103Z scale_ub: Optional[float], 2025-05-07T20:32:51.8958373Z contiguous: bool, 2025-05-07T20:32:51.8958622Z compiled: bool, 2025-05-07T20:32:51.8958851Z ) -> None: 2025-05-07T20:32:51.8959069Z torch.manual_seed(2025) 2025-05-07T20:32:51.8959720Z 2025-05-07T20:32:51.8960000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.8960334Z 2025-05-07T20:32:51.8960537Z x_sign = torch.sign(x) 2025-05-07T20:32:51.8960831Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.8961144Z x = x_sign * x_clamp 2025-05-07T20:32:51.8961392Z x0 = x[:, :D] 2025-05-07T20:32:51.8961625Z x1 = x[:, D:] 2025-05-07T20:32:51.8961846Z 2025-05-07T20:32:51.8962067Z if contiguous: 2025-05-07T20:32:51.8962347Z x0 = x0.contiguous() 2025-05-07T20:32:51.8962626Z x1 = x1.contiguous() 2025-05-07T20:32:51.8962872Z 2025-05-07T20:32:51.8963078Z if scale_ub is not None: 2025-05-07T20:32:51.8963364Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.8963699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.8964013Z ) 2025-05-07T20:32:51.8964222Z else: 2025-05-07T20:32:51.8964437Z scale_ub_tensor = None 2025-05-07T20:32:51.8964693Z 2025-05-07T20:32:51.8964935Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.8965246Z op = silu_mul_quant 2025-05-07T20:32:51.8965512Z if compiled: 2025-05-07T20:32:51.8965769Z op = torch.compile(op) 2025-05-07T20:32:51.8966149Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8966419Z 2025-05-07T20:32:51.8966611Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.8966832Z 2025-05-07T20:32:51.8966937Z moe/activation_test.py:117: 2025-05-07T20:32:51.8967228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8967566Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.8967853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8968407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.8969035Z return fn(*args, **kwargs) 2025-05-07T20:32:51.8969689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.8970374Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.8970901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.8971581Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.8972340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.8972893Z kernel = self.compile( 2025-05-07T20:32:51.8973478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.8974123Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.8974520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8974743Z 2025-05-07T20:32:51.8974950Z self = 2025-05-07T20:32:51.8976014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.8977374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a0f9440>} 2025-05-07T20:32:51.8978695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.8979707Z context = 2025-05-07T20:32:51.8980016Z 2025-05-07T20:32:51.8987324Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.8987899Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.8988375Z module_map=module_map) 2025-05-07T20:32:51.8988742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.8989113Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.8989379Z E ^ 2025-05-07T20:32:51.8989846Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.8990301Z 2025-05-07T20:32:51.8990724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.8991244Z 2025-05-07T20:32:52.0579951Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.0580427Z self=, 2025-05-07T20:32:52.0580853Z T=16384, 2025-05-07T20:32:52.0581056Z D=5120, 2025-05-07T20:32:52.0581262Z scale_ub=1200.0, 2025-05-07T20:32:52.0581494Z contiguous=False, 2025-05-07T20:32:52.0581722Z compiled=False, 2025-05-07T20:32:52.0581939Z ) 2025-05-07T20:32:52.0582266Z self = 2025-05-07T20:32:52.0582969Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.0583266Z 2025-05-07T20:32:52.0583431Z @given( 2025-05-07T20:32:52.0583684Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.0583996Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.0584310Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.0584650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.0584984Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.0585350Z ) 2025-05-07T20:32:52.0585702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.0586149Z def test_silu_mul_quant( 2025-05-07T20:32:52.0586391Z self, 2025-05-07T20:32:52.0586599Z T: int, 2025-05-07T20:32:52.0586814Z D: int, 2025-05-07T20:32:52.0587039Z scale_ub: Optional[float], 2025-05-07T20:32:52.0587318Z contiguous: bool, 2025-05-07T20:32:52.0587573Z compiled: bool, 2025-05-07T20:32:52.0587803Z ) -> None: 2025-05-07T20:32:52.0588028Z torch.manual_seed(2025) 2025-05-07T20:32:52.0588361Z 2025-05-07T20:32:52.0588639Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.0588991Z 2025-05-07T20:32:52.0589197Z x_sign = torch.sign(x) 2025-05-07T20:32:52.0589490Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.0589805Z x = x_sign * x_clamp 2025-05-07T20:32:52.0590055Z x0 = x[:, :D] 2025-05-07T20:32:52.0590287Z x1 = x[:, D:] 2025-05-07T20:32:52.0590497Z 2025-05-07T20:32:52.0590692Z if contiguous: 2025-05-07T20:32:52.0590935Z x0 = x0.contiguous() 2025-05-07T20:32:52.0591192Z x1 = x1.contiguous() 2025-05-07T20:32:52.0591447Z 2025-05-07T20:32:52.0591645Z if scale_ub is not None: 2025-05-07T20:32:52.0591923Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.0592316Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.0592627Z ) 2025-05-07T20:32:52.0592824Z else: 2025-05-07T20:32:52.0593045Z scale_ub_tensor = None 2025-05-07T20:32:52.0593309Z 2025-05-07T20:32:52.0593545Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.0593863Z op = silu_mul_quant 2025-05-07T20:32:52.0594121Z if compiled: 2025-05-07T20:32:52.0594374Z op = torch.compile(op) 2025-05-07T20:32:52.0594691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.0594978Z 2025-05-07T20:32:52.0595186Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.0595352Z 2025-05-07T20:32:52.0595457Z moe/activation_test.py:117: 2025-05-07T20:32:52.0595762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.0596102Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.0596388Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.0597093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.0597790Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.0598328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.0599013Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.0599689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.0600235Z kernel = self.compile( 2025-05-07T20:32:52.0600778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.0601439Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.0601842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.0602124Z 2025-05-07T20:32:52.0602368Z self = 2025-05-07T20:32:52.0603857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.0605231Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a0fa340>} 2025-05-07T20:32:52.0606604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.0607628Z context = 2025-05-07T20:32:52.0607914Z 2025-05-07T20:32:52.0608095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.0608649Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.0609126Z module_map=module_map) 2025-05-07T20:32:52.0609501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.0609854Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.0610123Z E ^ 2025-05-07T20:32:52.0610589Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.0611036Z 2025-05-07T20:32:52.0611459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.0611963Z 2025-05-07T20:32:52.0612071Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.0612492Z self=, 2025-05-07T20:32:52.0612896Z T=16384, 2025-05-07T20:32:52.0613175Z D=5120, 2025-05-07T20:32:52.0613379Z scale_ub=1200.0, 2025-05-07T20:32:52.0613613Z contiguous=True, 2025-05-07T20:32:52.0613839Z compiled=True, 2025-05-07T20:32:52.0614051Z ) 2025-05-07T20:32:52.0614373Z self = 2025-05-07T20:32:52.0614863Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.0615148Z 2025-05-07T20:32:52.0615225Z @given( 2025-05-07T20:32:52.0615465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.0615792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.0616103Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.0616437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.0616773Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.0617060Z ) 2025-05-07T20:32:52.0617415Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.0617866Z def test_silu_mul_quant( 2025-05-07T20:32:52.0618109Z self, 2025-05-07T20:32:52.0618323Z T: int, 2025-05-07T20:32:52.0618537Z D: int, 2025-05-07T20:32:52.0618765Z scale_ub: Optional[float], 2025-05-07T20:32:52.0619054Z contiguous: bool, 2025-05-07T20:32:52.0619309Z compiled: bool, 2025-05-07T20:32:52.0619545Z ) -> None: 2025-05-07T20:32:52.0619766Z torch.manual_seed(2025) 2025-05-07T20:32:52.0620029Z 2025-05-07T20:32:52.0620317Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.0620670Z 2025-05-07T20:32:52.0620877Z x_sign = torch.sign(x) 2025-05-07T20:32:52.0621188Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.0621502Z x = x_sign * x_clamp 2025-05-07T20:32:52.0621761Z x0 = x[:, :D] 2025-05-07T20:32:52.0621993Z x1 = x[:, D:] 2025-05-07T20:32:52.0622261Z 2025-05-07T20:32:52.0622464Z if contiguous: 2025-05-07T20:32:52.0622707Z x0 = x0.contiguous() 2025-05-07T20:32:52.0623016Z x1 = x1.contiguous() 2025-05-07T20:32:52.0623277Z 2025-05-07T20:32:52.0623486Z if scale_ub is not None: 2025-05-07T20:32:52.0623764Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.0624115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.0624444Z ) 2025-05-07T20:32:52.0624654Z else: 2025-05-07T20:32:52.0624875Z scale_ub_tensor = None 2025-05-07T20:32:52.0625186Z 2025-05-07T20:32:52.0625432Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.0625753Z op = silu_mul_quant 2025-05-07T20:32:52.0626017Z if compiled: 2025-05-07T20:32:52.0626287Z op = torch.compile(op) 2025-05-07T20:32:52.0626587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.0626880Z 2025-05-07T20:32:52.0627090Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.0627258Z 2025-05-07T20:32:52.0627362Z moe/activation_test.py:117: 2025-05-07T20:32:52.0627720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.0628059Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.0628351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.0628915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.0629485Z return fn(*args, **kwargs) 2025-05-07T20:32:52.0630148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.0630831Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.0631373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.0632088Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.0632763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.0633300Z kernel = self.compile( 2025-05-07T20:32:52.0633852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.0634507Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.0634904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.0635144Z 2025-05-07T20:32:52.0635353Z self = 2025-05-07T20:32:52.0636432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.0637786Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a0fb9c0>} 2025-05-07T20:32:52.0639117Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.0640134Z context = 2025-05-07T20:32:52.0640430Z 2025-05-07T20:32:52.0640595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.0641121Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.0641580Z module_map=module_map) 2025-05-07T20:32:52.0641949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.0642307Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.0642619Z E ^ 2025-05-07T20:32:52.0643076Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.0643563Z 2025-05-07T20:32:52.0643980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.0644481Z 2025-05-07T20:32:52.2355285Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2356191Z self=, 2025-05-07T20:32:52.2356985Z T=16384, 2025-05-07T20:32:52.2357822Z D=5120, 2025-05-07T20:32:52.2358213Z scale_ub=None, 2025-05-07T20:32:52.2358656Z contiguous=False, 2025-05-07T20:32:52.2359107Z compiled=True, 2025-05-07T20:32:52.2359906Z ) 2025-05-07T20:32:52.2360535Z self = 2025-05-07T20:32:52.2361527Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.2362046Z 2025-05-07T20:32:52.2362128Z @given( 2025-05-07T20:32:52.2362363Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2362769Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2363081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2363421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2363744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2364044Z ) 2025-05-07T20:32:52.2364395Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2364849Z def test_silu_mul_quant( 2025-05-07T20:32:52.2365089Z self, 2025-05-07T20:32:52.2365298Z T: int, 2025-05-07T20:32:52.2365506Z D: int, 2025-05-07T20:32:52.2365732Z scale_ub: Optional[float], 2025-05-07T20:32:52.2366008Z contiguous: bool, 2025-05-07T20:32:52.2366248Z compiled: bool, 2025-05-07T20:32:52.2366469Z ) -> None: 2025-05-07T20:32:52.2366694Z torch.manual_seed(2025) 2025-05-07T20:32:52.2366944Z 2025-05-07T20:32:52.2367214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2367560Z 2025-05-07T20:32:52.2367754Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2368041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2368350Z x = x_sign * x_clamp 2025-05-07T20:32:52.2368594Z x0 = x[:, :D] 2025-05-07T20:32:52.2368808Z x1 = x[:, D:] 2025-05-07T20:32:52.2369018Z 2025-05-07T20:32:52.2369210Z if contiguous: 2025-05-07T20:32:52.2369442Z x0 = x0.contiguous() 2025-05-07T20:32:52.2369702Z x1 = x1.contiguous() 2025-05-07T20:32:52.2369946Z 2025-05-07T20:32:52.2370134Z if scale_ub is not None: 2025-05-07T20:32:52.2370410Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2370745Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2371058Z ) 2025-05-07T20:32:52.2371255Z else: 2025-05-07T20:32:52.2371471Z scale_ub_tensor = None 2025-05-07T20:32:52.2371729Z 2025-05-07T20:32:52.2371963Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2372277Z op = silu_mul_quant 2025-05-07T20:32:52.2372534Z if compiled: 2025-05-07T20:32:52.2372782Z op = torch.compile(op) 2025-05-07T20:32:52.2373185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2373467Z 2025-05-07T20:32:52.2373657Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.2373829Z 2025-05-07T20:32:52.2373928Z moe/activation_test.py:117: 2025-05-07T20:32:52.2374224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2374554Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.2374836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2375394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.2376036Z return fn(*args, **kwargs) 2025-05-07T20:32:52.2376756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.2377442Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.2377979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2378656Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2379376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2379907Z kernel = self.compile( 2025-05-07T20:32:52.2380452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2381098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2381496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2381727Z 2025-05-07T20:32:52.2381977Z self = 2025-05-07T20:32:52.2383048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.2384405Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f34cc20>} 2025-05-07T20:32:52.2385730Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.2386742Z context = 2025-05-07T20:32:52.2387028Z 2025-05-07T20:32:52.2387200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.2387717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2388178Z module_map=module_map) 2025-05-07T20:32:52.2388545Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2388899Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.2389151Z E ^ 2025-05-07T20:32:52.2389614Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2390061Z 2025-05-07T20:32:52.2390479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2390981Z 2025-05-07T20:32:52.2391092Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2391505Z self=, 2025-05-07T20:32:52.2391919Z T=2048, 2025-05-07T20:32:52.2392153Z D=5120, 2025-05-07T20:32:52.2392363Z scale_ub=None, 2025-05-07T20:32:52.2392593Z contiguous=False, 2025-05-07T20:32:52.2392822Z compiled=True, 2025-05-07T20:32:52.2393021Z ) 2025-05-07T20:32:52.3296715Z self = 2025-05-07T20:32:52.3297241Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3297616Z 2025-05-07T20:32:52.3297726Z @given( 2025-05-07T20:32:52.3298058Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3298369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3298676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3299006Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3299330Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3299881Z ) 2025-05-07T20:32:52.3300239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3300757Z def test_silu_mul_quant( 2025-05-07T20:32:52.3301016Z self, 2025-05-07T20:32:52.3301220Z T: int, 2025-05-07T20:32:52.3301417Z D: int, 2025-05-07T20:32:52.3301644Z scale_ub: Optional[float], 2025-05-07T20:32:52.3301921Z contiguous: bool, 2025-05-07T20:32:52.3302168Z compiled: bool, 2025-05-07T20:32:52.3302402Z ) -> None: 2025-05-07T20:32:52.3302632Z torch.manual_seed(2025) 2025-05-07T20:32:52.3302958Z 2025-05-07T20:32:52.3303227Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3303576Z 2025-05-07T20:32:52.3303777Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3304068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3304390Z x = x_sign * x_clamp 2025-05-07T20:32:52.3304644Z x0 = x[:, :D] 2025-05-07T20:32:52.3304869Z x1 = x[:, D:] 2025-05-07T20:32:52.3305084Z 2025-05-07T20:32:52.3305271Z if contiguous: 2025-05-07T20:32:52.3305507Z x0 = x0.contiguous() 2025-05-07T20:32:52.3305845Z x1 = x1.contiguous() 2025-05-07T20:32:52.3306094Z 2025-05-07T20:32:52.3306283Z if scale_ub is not None: 2025-05-07T20:32:52.3306564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3306900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3307215Z ) 2025-05-07T20:32:52.3307409Z else: 2025-05-07T20:32:52.3307631Z scale_ub_tensor = None 2025-05-07T20:32:52.3307891Z 2025-05-07T20:32:52.3308124Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3308438Z op = silu_mul_quant 2025-05-07T20:32:52.3308698Z if compiled: 2025-05-07T20:32:52.3308945Z op = torch.compile(op) 2025-05-07T20:32:52.3309256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3309542Z 2025-05-07T20:32:52.3309729Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3309904Z 2025-05-07T20:32:52.3310009Z moe/activation_test.py:117: 2025-05-07T20:32:52.3310317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3310642Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3310933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3311494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3312058Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3312718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3313411Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3313944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3314644Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3315310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3315841Z kernel = self.compile( 2025-05-07T20:32:52.3316390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3317046Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3317453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3317697Z 2025-05-07T20:32:52.3317914Z self = 2025-05-07T20:32:52.3319000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3320454Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f34d9e0>} 2025-05-07T20:32:52.3321794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3322809Z context = 2025-05-07T20:32:52.3323094Z 2025-05-07T20:32:52.3323309Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3323832Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3324301Z module_map=module_map) 2025-05-07T20:32:52.3324676Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3325042Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3325307Z E ^ 2025-05-07T20:32:52.3325822Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3326270Z 2025-05-07T20:32:52.3326693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3327200Z 2025-05-07T20:32:52.3327331Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3327751Z self=, 2025-05-07T20:32:52.3328150Z T=2048, 2025-05-07T20:32:52.3328344Z D=5120, 2025-05-07T20:32:52.3328542Z scale_ub=1200.0, 2025-05-07T20:32:52.3328768Z contiguous=False, 2025-05-07T20:32:52.3328997Z compiled=True, 2025-05-07T20:32:52.3329209Z ) 2025-05-07T20:32:52.3329527Z self = 2025-05-07T20:32:52.3330028Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3330307Z 2025-05-07T20:32:52.3330398Z @given( 2025-05-07T20:32:52.3330639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3330961Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3331276Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3331615Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3331948Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3332243Z ) 2025-05-07T20:32:52.3332601Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3333131Z def test_silu_mul_quant( 2025-05-07T20:32:52.3333386Z self, 2025-05-07T20:32:52.3333590Z T: int, 2025-05-07T20:32:52.3333791Z D: int, 2025-05-07T20:32:52.3334025Z scale_ub: Optional[float], 2025-05-07T20:32:52.3334308Z contiguous: bool, 2025-05-07T20:32:52.3334552Z compiled: bool, 2025-05-07T20:32:52.3334794Z ) -> None: 2025-05-07T20:32:52.3335017Z torch.manual_seed(2025) 2025-05-07T20:32:52.3335264Z 2025-05-07T20:32:52.3343398Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3343796Z 2025-05-07T20:32:52.3343994Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3344298Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3344603Z x = x_sign * x_clamp 2025-05-07T20:32:52.3344844Z x0 = x[:, :D] 2025-05-07T20:32:52.3345059Z x1 = x[:, D:] 2025-05-07T20:32:52.3345275Z 2025-05-07T20:32:52.3345466Z if contiguous: 2025-05-07T20:32:52.3345695Z x0 = x0.contiguous() 2025-05-07T20:32:52.3345956Z x1 = x1.contiguous() 2025-05-07T20:32:52.3346203Z 2025-05-07T20:32:52.3346391Z if scale_ub is not None: 2025-05-07T20:32:52.3346676Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3347012Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3347400Z ) 2025-05-07T20:32:52.3347589Z else: 2025-05-07T20:32:52.3347846Z scale_ub_tensor = None 2025-05-07T20:32:52.3348103Z 2025-05-07T20:32:52.3348345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3348667Z op = silu_mul_quant 2025-05-07T20:32:52.3348918Z if compiled: 2025-05-07T20:32:52.3349177Z op = torch.compile(op) 2025-05-07T20:32:52.3349481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3349751Z 2025-05-07T20:32:52.3350000Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3350174Z 2025-05-07T20:32:52.3350276Z moe/activation_test.py:117: 2025-05-07T20:32:52.3350581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3350915Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3351197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3351764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3352325Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3353030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3353722Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3354268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3354943Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3355622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3356156Z kernel = self.compile( 2025-05-07T20:32:52.3356695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3357348Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3357754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3357987Z 2025-05-07T20:32:52.3358206Z self = 2025-05-07T20:32:52.3359563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3360935Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f34eb60>} 2025-05-07T20:32:52.3362268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3363294Z context = 2025-05-07T20:32:52.3363582Z 2025-05-07T20:32:52.3363765Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3364279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3364753Z module_map=module_map) 2025-05-07T20:32:52.3365130Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3365479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3365746Z E ^ 2025-05-07T20:32:52.3366216Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3366664Z 2025-05-07T20:32:52.3367098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3367608Z 2025-05-07T20:32:52.5109977Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.5110764Z self=, 2025-05-07T20:32:52.5111437Z T=4096, 2025-05-07T20:32:52.5111653Z D=5120, 2025-05-07T20:32:52.5111911Z scale_ub=1200.0, 2025-05-07T20:32:52.5112376Z contiguous=True, 2025-05-07T20:32:52.5112822Z compiled=True, 2025-05-07T20:32:52.5113235Z ) 2025-05-07T20:32:52.5113868Z self = 2025-05-07T20:32:52.5114858Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.5115537Z 2025-05-07T20:32:52.5115706Z @given( 2025-05-07T20:32:52.5116163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.5116792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.5117404Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.5118047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.5118706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.5119280Z ) 2025-05-07T20:32:52.5119988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.5120986Z def test_silu_mul_quant( 2025-05-07T20:32:52.5121484Z self, 2025-05-07T20:32:52.5121881Z T: int, 2025-05-07T20:32:52.5122196Z D: int, 2025-05-07T20:32:52.5122456Z scale_ub: Optional[float], 2025-05-07T20:32:52.5122755Z contiguous: bool, 2025-05-07T20:32:52.5122997Z compiled: bool, 2025-05-07T20:32:52.5123231Z ) -> None: 2025-05-07T20:32:52.5123458Z torch.manual_seed(2025) 2025-05-07T20:32:52.5123699Z 2025-05-07T20:32:52.5123977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.5124327Z 2025-05-07T20:32:52.5124520Z x_sign = torch.sign(x) 2025-05-07T20:32:52.5124820Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.5125136Z x = x_sign * x_clamp 2025-05-07T20:32:52.5125389Z x0 = x[:, :D] 2025-05-07T20:32:52.5125613Z x1 = x[:, D:] 2025-05-07T20:32:52.5125831Z 2025-05-07T20:32:52.5126031Z if contiguous: 2025-05-07T20:32:52.5126264Z x0 = x0.contiguous() 2025-05-07T20:32:52.5126536Z x1 = x1.contiguous() 2025-05-07T20:32:52.5126783Z 2025-05-07T20:32:52.5126977Z if scale_ub is not None: 2025-05-07T20:32:52.5127259Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.5127604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.5127915Z ) 2025-05-07T20:32:52.5128121Z else: 2025-05-07T20:32:52.5128343Z scale_ub_tensor = None 2025-05-07T20:32:52.5128598Z 2025-05-07T20:32:52.5128844Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.5129166Z op = silu_mul_quant 2025-05-07T20:32:52.5129416Z if compiled: 2025-05-07T20:32:52.5129676Z op = torch.compile(op) 2025-05-07T20:32:52.5129986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.5130266Z 2025-05-07T20:32:52.5130469Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.5130646Z 2025-05-07T20:32:52.5130748Z moe/activation_test.py:117: 2025-05-07T20:32:52.5131054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.5131387Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.5131677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.5132240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.5132799Z return fn(*args, **kwargs) 2025-05-07T20:32:52.5133575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.5134268Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.5134812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.5135588Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.5136265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.5136804Z kernel = self.compile( 2025-05-07T20:32:52.5137350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.5138010Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.5138457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.5138685Z 2025-05-07T20:32:52.5138902Z self = 2025-05-07T20:32:52.5139981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.5141410Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f4ec180>} 2025-05-07T20:32:52.5142810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.5143834Z context = 2025-05-07T20:32:52.5144127Z 2025-05-07T20:32:52.5144301Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.5144821Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.5145294Z module_map=module_map) 2025-05-07T20:32:52.5145675Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.5146033Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.5146303Z E ^ 2025-05-07T20:32:52.5146782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.5147237Z 2025-05-07T20:32:52.5147667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.5148179Z 2025-05-07T20:32:52.5148300Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.5148727Z self=, 2025-05-07T20:32:52.5149142Z T=128, 2025-05-07T20:32:52.5149334Z D=5120, 2025-05-07T20:32:52.5149541Z scale_ub=1200.0, 2025-05-07T20:32:52.5149777Z contiguous=False, 2025-05-07T20:32:52.5150013Z compiled=True, 2025-05-07T20:32:52.5150217Z ) 2025-05-07T20:32:52.7821649Z self = 2025-05-07T20:32:52.7822308Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.7822718Z 2025-05-07T20:32:52.7822850Z @given( 2025-05-07T20:32:52.7823163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.7823575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.7823888Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.7824222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.7824555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.7824857Z ) 2025-05-07T20:32:52.7825212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.7825657Z def test_silu_mul_quant( 2025-05-07T20:32:52.7825909Z self, 2025-05-07T20:32:52.7826116Z T: int, 2025-05-07T20:32:52.7826314Z D: int, 2025-05-07T20:32:52.7826543Z scale_ub: Optional[float], 2025-05-07T20:32:52.7827119Z contiguous: bool, 2025-05-07T20:32:52.7827357Z compiled: bool, 2025-05-07T20:32:52.7827599Z ) -> None: 2025-05-07T20:32:52.7827912Z torch.manual_seed(2025) 2025-05-07T20:32:52.7828163Z 2025-05-07T20:32:52.7828442Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.7828793Z 2025-05-07T20:32:52.7828990Z x_sign = torch.sign(x) 2025-05-07T20:32:52.7829291Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.7829610Z x = x_sign * x_clamp 2025-05-07T20:32:52.7829848Z x0 = x[:, :D] 2025-05-07T20:32:52.7830213Z x1 = x[:, D:] 2025-05-07T20:32:52.7830429Z 2025-05-07T20:32:52.7830621Z if contiguous: 2025-05-07T20:32:52.7830856Z x0 = x0.contiguous() 2025-05-07T20:32:52.7831120Z x1 = x1.contiguous() 2025-05-07T20:32:52.7831367Z 2025-05-07T20:32:52.7831564Z if scale_ub is not None: 2025-05-07T20:32:52.7831840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.7832175Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.7832493Z ) 2025-05-07T20:32:52.7832723Z else: 2025-05-07T20:32:52.7833034Z scale_ub_tensor = None 2025-05-07T20:32:52.7833293Z 2025-05-07T20:32:52.7833523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.7833839Z op = silu_mul_quant 2025-05-07T20:32:52.7834093Z if compiled: 2025-05-07T20:32:52.7834337Z op = torch.compile(op) 2025-05-07T20:32:52.7834640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.7834926Z 2025-05-07T20:32:52.7835143Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.7835306Z 2025-05-07T20:32:52.7835407Z moe/activation_test.py:117: 2025-05-07T20:32:52.7835704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.7836033Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.7836311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.7836873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.7837436Z return fn(*args, **kwargs) 2025-05-07T20:32:52.7838094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.7838770Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.7839303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.7839982Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.7840643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.7841164Z kernel = self.compile( 2025-05-07T20:32:52.7841708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.7842364Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.7842758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.7842989Z 2025-05-07T20:32:52.7843197Z self = 2025-05-07T20:32:52.7844261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.7845628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f4ecea0>} 2025-05-07T20:32:52.7846947Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.7848011Z context = 2025-05-07T20:32:52.7848346Z 2025-05-07T20:32:52.7848516Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.7849029Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.7849492Z module_map=module_map) 2025-05-07T20:32:52.7849850Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.7850248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.7850511Z E ^ 2025-05-07T20:32:52.7850965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.7851420Z 2025-05-07T20:32:52.7851830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.7852363Z 2025-05-07T20:32:52.7852483Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.7853075Z self=, 2025-05-07T20:32:52.7853476Z T=16384, 2025-05-07T20:32:52.7853675Z D=7168, 2025-05-07T20:32:52.7853874Z scale_ub=1200.0, 2025-05-07T20:32:52.7854095Z contiguous=True, 2025-05-07T20:32:52.7854317Z compiled=True, 2025-05-07T20:32:52.7854524Z ) 2025-05-07T20:32:52.7854839Z self = 2025-05-07T20:32:52.7855331Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.7855606Z 2025-05-07T20:32:52.7855698Z @given( 2025-05-07T20:32:52.7855925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.7856240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.7856548Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.7856880Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.7857205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.7857494Z ) 2025-05-07T20:32:52.7857851Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.7858283Z def test_silu_mul_quant( 2025-05-07T20:32:52.7858528Z self, 2025-05-07T20:32:52.7858731Z T: int, 2025-05-07T20:32:52.7858926Z D: int, 2025-05-07T20:32:52.7859152Z scale_ub: Optional[float], 2025-05-07T20:32:52.7859727Z contiguous: bool, 2025-05-07T20:32:52.7859967Z compiled: bool, 2025-05-07T20:32:52.7860189Z ) -> None: 2025-05-07T20:32:52.7860407Z torch.manual_seed(2025) 2025-05-07T20:32:52.7860649Z 2025-05-07T20:32:52.7860916Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.7861261Z 2025-05-07T20:32:52.7861447Z x_sign = torch.sign(x) 2025-05-07T20:32:52.7861742Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.7862064Z x = x_sign * x_clamp 2025-05-07T20:32:52.7862307Z x0 = x[:, :D] 2025-05-07T20:32:52.7862531Z x1 = x[:, D:] 2025-05-07T20:32:52.7862793Z 2025-05-07T20:32:52.7862992Z if contiguous: 2025-05-07T20:32:52.7863224Z x0 = x0.contiguous() 2025-05-07T20:32:52.7863487Z x1 = x1.contiguous() 2025-05-07T20:32:52.7863734Z 2025-05-07T20:32:52.7863929Z if scale_ub is not None: 2025-05-07T20:32:52.7864212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.7864549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.7864856Z ) 2025-05-07T20:32:52.7865057Z else: 2025-05-07T20:32:52.7865274Z scale_ub_tensor = None 2025-05-07T20:32:52.7865529Z 2025-05-07T20:32:52.7865764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.7866082Z op = silu_mul_quant 2025-05-07T20:32:52.7866333Z if compiled: 2025-05-07T20:32:52.7866660Z op = torch.compile(op) 2025-05-07T20:32:52.7866956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.7867299Z 2025-05-07T20:32:52.7867497Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.7867670Z 2025-05-07T20:32:52.7867772Z moe/activation_test.py:117: 2025-05-07T20:32:52.7868076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.7868404Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.7868692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.7869315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.7869864Z return fn(*args, **kwargs) 2025-05-07T20:32:52.7870518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.7871206Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.7871742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.7872487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.7873152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.7873683Z kernel = self.compile( 2025-05-07T20:32:52.7874225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.7874867Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.7875267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.7875493Z 2025-05-07T20:32:52.7875705Z self = 2025-05-07T20:32:52.7876774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.7878128Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f4ee0c0>} 2025-05-07T20:32:52.7879446Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.7880460Z context = 2025-05-07T20:32:52.7880744Z 2025-05-07T20:32:52.7880914Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.7881425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.7881893Z module_map=module_map) 2025-05-07T20:32:52.7882263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.7882623Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.7882887Z E ^ 2025-05-07T20:32:52.7883357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.7883801Z 2025-05-07T20:32:52.7884217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.7884717Z 2025-05-07T20:32:52.9117377Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.9117938Z self=, 2025-05-07T20:32:52.9118487Z T=16384, 2025-05-07T20:32:52.9118713Z D=5120, 2025-05-07T20:32:52.9118912Z scale_ub=1200.0, 2025-05-07T20:32:52.9119130Z contiguous=True, 2025-05-07T20:32:52.9119364Z compiled=False, 2025-05-07T20:32:52.9119574Z ) 2025-05-07T20:32:52.9120157Z self = 2025-05-07T20:32:52.9120742Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.9121028Z 2025-05-07T20:32:52.9121110Z @given( 2025-05-07T20:32:52.9121344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.9121657Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.9121966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.9122298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.9122724Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.9123010Z ) 2025-05-07T20:32:52.9123363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.9123813Z def test_silu_mul_quant( 2025-05-07T20:32:52.9124056Z self, 2025-05-07T20:32:52.9124263Z T: int, 2025-05-07T20:32:52.9124473Z D: int, 2025-05-07T20:32:52.9124697Z scale_ub: Optional[float], 2025-05-07T20:32:52.9124970Z contiguous: bool, 2025-05-07T20:32:52.9125216Z compiled: bool, 2025-05-07T20:32:52.9125513Z ) -> None: 2025-05-07T20:32:52.9125735Z torch.manual_seed(2025) 2025-05-07T20:32:52.9125989Z 2025-05-07T20:32:52.9126256Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.9126598Z 2025-05-07T20:32:52.9126790Z x_sign = torch.sign(x) 2025-05-07T20:32:52.9127073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.9127378Z x = x_sign * x_clamp 2025-05-07T20:32:52.9127623Z x0 = x[:, :D] 2025-05-07T20:32:52.9127833Z x1 = x[:, D:] 2025-05-07T20:32:52.9128048Z 2025-05-07T20:32:52.9128244Z if contiguous: 2025-05-07T20:32:52.9128473Z x0 = x0.contiguous() 2025-05-07T20:32:52.9128734Z x1 = x1.contiguous() 2025-05-07T20:32:52.9128980Z 2025-05-07T20:32:52.9129180Z if scale_ub is not None: 2025-05-07T20:32:52.9129456Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.9129802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.9130116Z ) 2025-05-07T20:32:52.9130310Z else: 2025-05-07T20:32:52.9130527Z scale_ub_tensor = None 2025-05-07T20:32:52.9130777Z 2025-05-07T20:32:52.9131003Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.9131317Z op = silu_mul_quant 2025-05-07T20:32:52.9131568Z if compiled: 2025-05-07T20:32:52.9131810Z op = torch.compile(op) 2025-05-07T20:32:52.9132111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.9132392Z 2025-05-07T20:32:52.9132585Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.9132758Z 2025-05-07T20:32:52.9132860Z moe/activation_test.py:117: 2025-05-07T20:32:52.9133280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.9133615Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.9133901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.9134589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.9135275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.9135799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.9136472Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.9144695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.9145239Z kernel = self.compile( 2025-05-07T20:32:52.9145773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.9146411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.9146906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.9147133Z 2025-05-07T20:32:52.9147386Z self = 2025-05-07T20:32:52.9148457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.9149820Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f4eda80>} 2025-05-07T20:32:52.9151191Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.9152201Z context = 2025-05-07T20:32:52.9152526Z 2025-05-07T20:32:52.9152706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.9153270Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.9153736Z module_map=module_map) 2025-05-07T20:32:52.9154106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.9154454Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.9154723Z E ^ 2025-05-07T20:32:52.9155187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.9155635Z 2025-05-07T20:32:52.9156048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.9156558Z 2025-05-07T20:32:52.9156662Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.9157079Z self=, 2025-05-07T20:32:52.9157478Z T=1, 2025-05-07T20:32:52.9157659Z D=7168, 2025-05-07T20:32:52.9157864Z scale_ub=1200.0, 2025-05-07T20:32:52.9158090Z contiguous=False, 2025-05-07T20:32:52.9158311Z compiled=False, 2025-05-07T20:32:52.9158522Z ) 2025-05-07T20:32:52.9158846Z self = 2025-05-07T20:32:52.9159612Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.9159886Z 2025-05-07T20:32:52.9159967Z @given( 2025-05-07T20:32:52.9160200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.9160504Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.9160812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.9161138Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.9161464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.9161744Z ) 2025-05-07T20:32:52.9162089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.9162527Z def test_silu_mul_quant( 2025-05-07T20:32:52.9162793Z self, 2025-05-07T20:32:52.9163009Z T: int, 2025-05-07T20:32:52.9163207Z D: int, 2025-05-07T20:32:52.9163426Z scale_ub: Optional[float], 2025-05-07T20:32:52.9163700Z contiguous: bool, 2025-05-07T20:32:52.9163947Z compiled: bool, 2025-05-07T20:32:52.9164175Z ) -> None: 2025-05-07T20:32:52.9164401Z torch.manual_seed(2025) 2025-05-07T20:32:52.9164647Z 2025-05-07T20:32:52.9164914Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.9165263Z 2025-05-07T20:32:52.9165463Z x_sign = torch.sign(x) 2025-05-07T20:32:52.9165762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.9166064Z x = x_sign * x_clamp 2025-05-07T20:32:52.9166314Z x0 = x[:, :D] 2025-05-07T20:32:52.9166627Z x1 = x[:, D:] 2025-05-07T20:32:52.9166835Z 2025-05-07T20:32:52.9167035Z if contiguous: 2025-05-07T20:32:52.9167342Z x0 = x0.contiguous() 2025-05-07T20:32:52.9167599Z x1 = x1.contiguous() 2025-05-07T20:32:52.9167852Z 2025-05-07T20:32:52.9168046Z if scale_ub is not None: 2025-05-07T20:32:52.9168319Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.9168642Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.9168958Z ) 2025-05-07T20:32:52.9169149Z else: 2025-05-07T20:32:52.9169439Z scale_ub_tensor = None 2025-05-07T20:32:52.9169699Z 2025-05-07T20:32:52.9169928Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.9170245Z op = silu_mul_quant 2025-05-07T20:32:52.9170496Z if compiled: 2025-05-07T20:32:52.9170750Z op = torch.compile(op) 2025-05-07T20:32:52.9171041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.9171315Z 2025-05-07T20:32:52.9171511Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.9171673Z 2025-05-07T20:32:52.9171843Z moe/activation_test.py:117: 2025-05-07T20:32:52.9172141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.9172474Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.9172750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.9173498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.9174185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.9174716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.9175385Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.9176053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.9176591Z kernel = self.compile( 2025-05-07T20:32:52.9177128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.9177772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.9178168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.9178391Z 2025-05-07T20:32:52.9178607Z self = 2025-05-07T20:32:52.9179668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.9181022Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ef680e0>} 2025-05-07T20:32:52.9182355Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.9183361Z context = 2025-05-07T20:32:52.9183644Z 2025-05-07T20:32:52.9183818Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.9184330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.9184801Z module_map=module_map) 2025-05-07T20:32:52.9185169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.9185522Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.9185790Z E ^ 2025-05-07T20:32:52.9186252Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.9186747Z 2025-05-07T20:32:52.9187207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.9187709Z 2025-05-07T20:32:53.0917260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.0918307Z self=, 2025-05-07T20:32:53.0919126Z T=4096, 2025-05-07T20:32:53.0919513Z D=7168, 2025-05-07T20:32:53.0919897Z scale_ub=1200.0, 2025-05-07T20:32:53.0920356Z contiguous=False, 2025-05-07T20:32:53.0921116Z compiled=True, 2025-05-07T20:32:53.0921530Z ) 2025-05-07T20:32:53.0922163Z self = 2025-05-07T20:32:53.0922783Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.0923059Z 2025-05-07T20:32:53.0923149Z @given( 2025-05-07T20:32:53.0923384Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0923718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0924039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0924456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0924793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0925097Z ) 2025-05-07T20:32:53.0925447Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0925903Z def test_silu_mul_quant( 2025-05-07T20:32:53.0926157Z self, 2025-05-07T20:32:53.0926371Z T: int, 2025-05-07T20:32:53.0926575Z D: int, 2025-05-07T20:32:53.0926810Z scale_ub: Optional[float], 2025-05-07T20:32:53.0927097Z contiguous: bool, 2025-05-07T20:32:53.0927350Z compiled: bool, 2025-05-07T20:32:53.0927588Z ) -> None: 2025-05-07T20:32:53.0927816Z torch.manual_seed(2025) 2025-05-07T20:32:53.0928060Z 2025-05-07T20:32:53.0928344Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0928697Z 2025-05-07T20:32:53.0928897Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0929206Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0929531Z x = x_sign * x_clamp 2025-05-07T20:32:53.0929775Z x0 = x[:, :D] 2025-05-07T20:32:53.0930004Z x1 = x[:, D:] 2025-05-07T20:32:53.0930226Z 2025-05-07T20:32:53.0930420Z if contiguous: 2025-05-07T20:32:53.0930663Z x0 = x0.contiguous() 2025-05-07T20:32:53.0930932Z x1 = x1.contiguous() 2025-05-07T20:32:53.0931188Z 2025-05-07T20:32:53.0931383Z if scale_ub is not None: 2025-05-07T20:32:53.0931669Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0932016Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0932328Z ) 2025-05-07T20:32:53.0932554Z else: 2025-05-07T20:32:53.0932800Z scale_ub_tensor = None 2025-05-07T20:32:53.0933151Z 2025-05-07T20:32:53.0933390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0933708Z op = silu_mul_quant 2025-05-07T20:32:53.0933960Z if compiled: 2025-05-07T20:32:53.0934218Z op = torch.compile(op) 2025-05-07T20:32:53.0934529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0934813Z 2025-05-07T20:32:53.0935006Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.0935179Z 2025-05-07T20:32:53.0935281Z moe/activation_test.py:117: 2025-05-07T20:32:53.0935585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0935925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.0936210Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0936772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.0937338Z return fn(*args, **kwargs) 2025-05-07T20:32:53.0938086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.0938847Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.0939393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0940073Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0940731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0941347Z kernel = self.compile( 2025-05-07T20:32:53.0941892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0942538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0942938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0943172Z 2025-05-07T20:32:53.0943380Z self = 2025-05-07T20:32:53.0944496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0945865Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ef69300>} 2025-05-07T20:32:53.0947184Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0948200Z context = 2025-05-07T20:32:53.0948492Z 2025-05-07T20:32:53.0948660Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0949186Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0949652Z module_map=module_map) 2025-05-07T20:32:53.0950022Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0950381Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0950638Z E ^ 2025-05-07T20:32:53.0951106Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0951562Z 2025-05-07T20:32:53.0951988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.0952522Z 2025-05-07T20:32:53.0952659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.0953075Z self=, 2025-05-07T20:32:53.0953488Z T=128, 2025-05-07T20:32:53.0953690Z D=7168, 2025-05-07T20:32:53.0953883Z scale_ub=1200.0, 2025-05-07T20:32:53.0954127Z contiguous=False, 2025-05-07T20:32:53.0954364Z compiled=True, 2025-05-07T20:32:53.0954568Z ) 2025-05-07T20:32:53.1866088Z self = 2025-05-07T20:32:53.1866603Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.1866965Z 2025-05-07T20:32:53.1867091Z @given( 2025-05-07T20:32:53.1867406Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.1867834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.1868274Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.1868703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.1869136Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.1869507Z ) 2025-05-07T20:32:53.1869858Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.1870562Z def test_silu_mul_quant( 2025-05-07T20:32:53.1870808Z self, 2025-05-07T20:32:53.1871003Z T: int, 2025-05-07T20:32:53.1871300Z D: int, 2025-05-07T20:32:53.1871526Z scale_ub: Optional[float], 2025-05-07T20:32:53.1871807Z contiguous: bool, 2025-05-07T20:32:53.1872052Z compiled: bool, 2025-05-07T20:32:53.1872284Z ) -> None: 2025-05-07T20:32:53.1872504Z torch.manual_seed(2025) 2025-05-07T20:32:53.1872788Z 2025-05-07T20:32:53.1873056Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.1873488Z 2025-05-07T20:32:53.1873692Z x_sign = torch.sign(x) 2025-05-07T20:32:53.1873989Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.1874293Z x = x_sign * x_clamp 2025-05-07T20:32:53.1874537Z x0 = x[:, :D] 2025-05-07T20:32:53.1874765Z x1 = x[:, D:] 2025-05-07T20:32:53.1874971Z 2025-05-07T20:32:53.1875167Z if contiguous: 2025-05-07T20:32:53.1875407Z x0 = x0.contiguous() 2025-05-07T20:32:53.1875667Z x1 = x1.contiguous() 2025-05-07T20:32:53.1875918Z 2025-05-07T20:32:53.1876197Z if scale_ub is not None: 2025-05-07T20:32:53.1876476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.1876826Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.1877144Z ) 2025-05-07T20:32:53.1877339Z else: 2025-05-07T20:32:53.1877560Z scale_ub_tensor = None 2025-05-07T20:32:53.1877824Z 2025-05-07T20:32:53.1878061Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1878387Z op = silu_mul_quant 2025-05-07T20:32:53.1878652Z if compiled: 2025-05-07T20:32:53.1878916Z op = torch.compile(op) 2025-05-07T20:32:53.1879217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1879506Z 2025-05-07T20:32:53.1879712Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.1879884Z 2025-05-07T20:32:53.1879986Z moe/activation_test.py:117: 2025-05-07T20:32:53.1880295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1880645Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.1880927Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1881490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.1882052Z return fn(*args, **kwargs) 2025-05-07T20:32:53.1882722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.1883396Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.1883939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.1884618Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.1885275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.1885812Z kernel = self.compile( 2025-05-07T20:32:53.1886368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.1887025Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.1887424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1887659Z 2025-05-07T20:32:53.1887871Z self = 2025-05-07T20:32:53.1888947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.1890314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ef6a020>} 2025-05-07T20:32:53.1891724Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.1892808Z context = 2025-05-07T20:32:53.1893203Z 2025-05-07T20:32:53.1893371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.1893933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.1894394Z module_map=module_map) 2025-05-07T20:32:53.1894763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.1895118Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.1895382Z E ^ 2025-05-07T20:32:53.1895843Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.1896294Z 2025-05-07T20:32:53.1896750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.1897255Z 2025-05-07T20:32:53.1897366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.1897779Z self=, 2025-05-07T20:32:53.1898179Z T=2048, 2025-05-07T20:32:53.1898373Z D=7168, 2025-05-07T20:32:53.1898571Z scale_ub=None, 2025-05-07T20:32:53.1898783Z contiguous=True, 2025-05-07T20:32:53.1899014Z compiled=True, 2025-05-07T20:32:53.1899225Z ) 2025-05-07T20:32:53.1899540Z self = 2025-05-07T20:32:53.1900033Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.1900300Z 2025-05-07T20:32:53.1900395Z @given( 2025-05-07T20:32:53.1900629Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.1900948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.1901277Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.1901615Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.1901947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.1902246Z ) 2025-05-07T20:32:53.1902617Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.1903090Z def test_silu_mul_quant( 2025-05-07T20:32:53.1903342Z self, 2025-05-07T20:32:53.1903552Z T: int, 2025-05-07T20:32:53.1903754Z D: int, 2025-05-07T20:32:53.1903992Z scale_ub: Optional[float], 2025-05-07T20:32:53.1904270Z contiguous: bool, 2025-05-07T20:32:53.1904514Z compiled: bool, 2025-05-07T20:32:53.1904751Z ) -> None: 2025-05-07T20:32:53.1904971Z torch.manual_seed(2025) 2025-05-07T20:32:53.1905221Z 2025-05-07T20:32:53.1905505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.1905858Z 2025-05-07T20:32:53.1906062Z x_sign = torch.sign(x) 2025-05-07T20:32:53.1906360Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.1906679Z x = x_sign * x_clamp 2025-05-07T20:32:53.1906932Z x0 = x[:, :D] 2025-05-07T20:32:53.1907155Z x1 = x[:, D:] 2025-05-07T20:32:53.1907376Z 2025-05-07T20:32:53.1907575Z if contiguous: 2025-05-07T20:32:53.1907805Z x0 = x0.contiguous() 2025-05-07T20:32:53.1908076Z x1 = x1.contiguous() 2025-05-07T20:32:53.1908322Z 2025-05-07T20:32:53.1908516Z if scale_ub is not None: 2025-05-07T20:32:53.1908798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.1909142Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.1909452Z ) 2025-05-07T20:32:53.1909653Z else: 2025-05-07T20:32:53.1909931Z scale_ub_tensor = None 2025-05-07T20:32:53.1910179Z 2025-05-07T20:32:53.1910458Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1910780Z op = silu_mul_quant 2025-05-07T20:32:53.1911030Z if compiled: 2025-05-07T20:32:53.1911288Z op = torch.compile(op) 2025-05-07T20:32:53.1911595Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1911875Z 2025-05-07T20:32:53.1912071Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.1912246Z 2025-05-07T20:32:53.1912358Z moe/activation_test.py:117: 2025-05-07T20:32:53.1912741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1913069Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.1913357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1913917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.1914478Z return fn(*args, **kwargs) 2025-05-07T20:32:53.1915188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.1915883Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.1916420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.1917102Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.1917775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.1918311Z kernel = self.compile( 2025-05-07T20:32:53.1918863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.1919525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.1919934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1920168Z 2025-05-07T20:32:53.1920389Z self = 2025-05-07T20:32:53.1921455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.1922861Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ef6b240>} 2025-05-07T20:32:53.1924212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.1925254Z context = 2025-05-07T20:32:53.1925543Z 2025-05-07T20:32:53.1925730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.1926263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.1926754Z module_map=module_map) 2025-05-07T20:32:53.1927142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.1927497Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.1927769Z E ^ 2025-05-07T20:32:53.1928249Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.1928710Z 2025-05-07T20:32:53.1929140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.1929642Z 2025-05-07T20:32:53.2536239Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.2536890Z self=, 2025-05-07T20:32:53.2537665Z T=16384, 2025-05-07T20:32:53.2537922Z D=5120, 2025-05-07T20:32:53.2538112Z scale_ub=None, 2025-05-07T20:32:53.2538465Z contiguous=False, 2025-05-07T20:32:53.2538700Z compiled=False, 2025-05-07T20:32:53.2538907Z ) 2025-05-07T20:32:53.2539246Z self = 2025-05-07T20:32:53.2539747Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.2540032Z 2025-05-07T20:32:53.2540115Z @given( 2025-05-07T20:32:53.2540344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.2540725Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.2541030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.2541366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.2541690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.2541977Z ) 2025-05-07T20:32:53.2542346Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.2542824Z def test_silu_mul_quant( 2025-05-07T20:32:53.2543068Z self, 2025-05-07T20:32:53.2543337Z T: int, 2025-05-07T20:32:53.2550829Z D: int, 2025-05-07T20:32:53.2551114Z scale_ub: Optional[float], 2025-05-07T20:32:53.2551399Z contiguous: bool, 2025-05-07T20:32:53.2551646Z compiled: bool, 2025-05-07T20:32:53.2551870Z ) -> None: 2025-05-07T20:32:53.2552095Z torch.manual_seed(2025) 2025-05-07T20:32:53.2552352Z 2025-05-07T20:32:53.2552636Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.2552982Z 2025-05-07T20:32:53.2553187Z x_sign = torch.sign(x) 2025-05-07T20:32:53.2553484Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.2555499Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.2557347Z 2025-05-07T20:32:53.2557469Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.2557694Z 2025-05-07T20:32:53.2557798Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.2558216Z self=, 2025-05-07T20:32:53.2558613Z T=4096, 2025-05-07T20:32:53.2558814Z D=7168, 2025-05-07T20:32:53.2559018Z scale_ub=1200.0, 2025-05-07T20:32:53.2559523Z contiguous=True, 2025-05-07T20:32:53.2559758Z compiled=True, 2025-05-07T20:32:53.2559967Z ) 2025-05-07T20:32:53.2560288Z self = 2025-05-07T20:32:53.2560790Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.2561062Z 2025-05-07T20:32:53.2561149Z @given( 2025-05-07T20:32:53.2561381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.2561700Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.2562014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.2562348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.2562673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.2562964Z ) 2025-05-07T20:32:53.2563316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.2563757Z def test_silu_mul_quant( 2025-05-07T20:32:53.2564008Z self, 2025-05-07T20:32:53.2564216Z T: int, 2025-05-07T20:32:53.2564418Z D: int, 2025-05-07T20:32:53.2564769Z scale_ub: Optional[float], 2025-05-07T20:32:53.2565044Z contiguous: bool, 2025-05-07T20:32:53.2565282Z compiled: bool, 2025-05-07T20:32:53.2565591Z ) -> None: 2025-05-07T20:32:53.2565823Z torch.manual_seed(2025) 2025-05-07T20:32:53.2566068Z 2025-05-07T20:32:53.2566352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.2566700Z 2025-05-07T20:32:53.2566907Z x_sign = torch.sign(x) 2025-05-07T20:32:53.2567198Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.2569206Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.2571139Z 2025-05-07T20:32:53.2571325Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.2571537Z 2025-05-07T20:32:53.2571650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.2572063Z self=, 2025-05-07T20:32:53.2572516Z T=16384, 2025-05-07T20:32:53.2572734Z D=7168, 2025-05-07T20:32:53.2572935Z scale_ub=None, 2025-05-07T20:32:53.2573237Z contiguous=False, 2025-05-07T20:32:53.2573479Z compiled=False, 2025-05-07T20:32:53.2573690Z ) 2025-05-07T20:32:53.2574007Z self = 2025-05-07T20:32:53.2574508Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.2574788Z 2025-05-07T20:32:53.2574890Z @given( 2025-05-07T20:32:53.2575124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.2575450Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.2575774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.2576108Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.2576455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.2576760Z ) 2025-05-07T20:32:53.2577125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.2577568Z def test_silu_mul_quant( 2025-05-07T20:32:53.2577825Z self, 2025-05-07T20:32:53.2578040Z T: int, 2025-05-07T20:32:53.2578243Z D: int, 2025-05-07T20:32:53.2578475Z scale_ub: Optional[float], 2025-05-07T20:32:53.2578761Z contiguous: bool, 2025-05-07T20:32:53.2579007Z compiled: bool, 2025-05-07T20:32:53.2579246Z ) -> None: 2025-05-07T20:32:53.2579469Z torch.manual_seed(2025) 2025-05-07T20:32:53.2579710Z 2025-05-07T20:32:53.2579995Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.2582037Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.2583894Z 2025-05-07T20:32:53.2584015Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.2584224Z 2025-05-07T20:32:53.2584338Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.2584743Z self=, 2025-05-07T20:32:53.2585202Z T=2048, 2025-05-07T20:32:53.2585402Z D=7168, 2025-05-07T20:32:53.2585590Z scale_ub=1200.0, 2025-05-07T20:32:53.2585824Z contiguous=True, 2025-05-07T20:32:53.2586098Z compiled=True, 2025-05-07T20:32:53.2586302Z ) 2025-05-07T20:32:53.2586629Z self = 2025-05-07T20:32:53.2587121Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.2587393Z 2025-05-07T20:32:53.2587480Z @given( 2025-05-07T20:32:53.2587710Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.2588068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.2588381Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.2588709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.2589048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.2589343Z ) 2025-05-07T20:32:53.2589686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.2590138Z def test_silu_mul_quant( 2025-05-07T20:32:53.2590385Z self, 2025-05-07T20:32:53.2590582Z T: int, 2025-05-07T20:32:53.2590822Z D: int, 2025-05-07T20:32:53.2591044Z scale_ub: Optional[float], 2025-05-07T20:32:53.2591315Z contiguous: bool, 2025-05-07T20:32:53.2591562Z compiled: bool, 2025-05-07T20:32:53.2591794Z ) -> None: 2025-05-07T20:32:53.2592015Z torch.manual_seed(2025) 2025-05-07T20:32:53.2592257Z 2025-05-07T20:32:53.2592534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.2592872Z 2025-05-07T20:32:53.2593067Z x_sign = torch.sign(x) 2025-05-07T20:32:53.2593364Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.2595351Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.2597184Z 2025-05-07T20:32:53.2597308Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.2597519Z 2025-05-07T20:32:53.2597623Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.2598041Z self=, 2025-05-07T20:32:53.2598442Z T=2048, 2025-05-07T20:32:53.2598634Z D=7168, 2025-05-07T20:32:53.2598820Z scale_ub=None, 2025-05-07T20:32:53.2599042Z contiguous=True, 2025-05-07T20:32:53.2599277Z compiled=False, 2025-05-07T20:32:53.2599475Z ) 2025-05-07T20:32:53.3733234Z self = 2025-05-07T20:32:53.3733989Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.3734370Z 2025-05-07T20:32:53.3734504Z @given( 2025-05-07T20:32:53.3734734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.3735052Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.3735361Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.3735706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.3736033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.3736328Z ) 2025-05-07T20:32:53.3736674Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.3737109Z def test_silu_mul_quant( 2025-05-07T20:32:53.3737356Z self, 2025-05-07T20:32:53.3737559Z T: int, 2025-05-07T20:32:53.3737756Z D: int, 2025-05-07T20:32:53.3737979Z scale_ub: Optional[float], 2025-05-07T20:32:53.3738499Z contiguous: bool, 2025-05-07T20:32:53.3738736Z compiled: bool, 2025-05-07T20:32:53.3738966Z ) -> None: 2025-05-07T20:32:53.3739284Z torch.manual_seed(2025) 2025-05-07T20:32:53.3739529Z 2025-05-07T20:32:53.3739814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.3740160Z 2025-05-07T20:32:53.3740355Z > x_sign = torch.sign(x) 2025-05-07T20:32:53.3742280Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.3744199Z 2025-05-07T20:32:53.3744317Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:53.3744535Z 2025-05-07T20:32:53.3744709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.3745129Z self=, 2025-05-07T20:32:53.3745522Z T=1, 2025-05-07T20:32:53.3745711Z D=7168, 2025-05-07T20:32:53.3745911Z scale_ub=1200.0, 2025-05-07T20:32:53.3746132Z contiguous=True, 2025-05-07T20:32:53.3746359Z compiled=False, 2025-05-07T20:32:53.3746572Z ) 2025-05-07T20:32:53.3746896Z self = 2025-05-07T20:32:53.3747375Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.3747648Z 2025-05-07T20:32:53.3747730Z @given( 2025-05-07T20:32:53.3747967Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.3748278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.3748581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.3748908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.3749245Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.3749523Z ) 2025-05-07T20:32:53.3749868Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.3750305Z def test_silu_mul_quant( 2025-05-07T20:32:53.3750539Z self, 2025-05-07T20:32:53.3750737Z T: int, 2025-05-07T20:32:53.3750944Z D: int, 2025-05-07T20:32:53.3751164Z scale_ub: Optional[float], 2025-05-07T20:32:53.3751436Z contiguous: bool, 2025-05-07T20:32:53.3751681Z compiled: bool, 2025-05-07T20:32:53.3751907Z ) -> None: 2025-05-07T20:32:53.3752126Z torch.manual_seed(2025) 2025-05-07T20:32:53.3752373Z 2025-05-07T20:32:53.3752646Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.3752980Z 2025-05-07T20:32:53.3753178Z x_sign = torch.sign(x) 2025-05-07T20:32:53.3753469Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.3753776Z x = x_sign * x_clamp 2025-05-07T20:32:53.3754024Z x0 = x[:, :D] 2025-05-07T20:32:53.3754240Z x1 = x[:, D:] 2025-05-07T20:32:53.3754445Z 2025-05-07T20:32:53.3754633Z if contiguous: 2025-05-07T20:32:53.3754866Z x0 = x0.contiguous() 2025-05-07T20:32:53.3755115Z x1 = x1.contiguous() 2025-05-07T20:32:53.3755354Z 2025-05-07T20:32:53.3755550Z if scale_ub is not None: 2025-05-07T20:32:53.3755822Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.3756153Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.3756476Z ) 2025-05-07T20:32:53.3756666Z else: 2025-05-07T20:32:53.3756884Z scale_ub_tensor = None 2025-05-07T20:32:53.3757137Z 2025-05-07T20:32:53.3757372Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.3757746Z op = silu_mul_quant 2025-05-07T20:32:53.3757998Z if compiled: 2025-05-07T20:32:53.3758287Z op = torch.compile(op) 2025-05-07T20:32:53.3758586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.3758863Z 2025-05-07T20:32:53.3759061Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.3759501Z 2025-05-07T20:32:53.3759606Z moe/activation_test.py:117: 2025-05-07T20:32:53.3759905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.3760241Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.3760613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.3761301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.3761986Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.3762515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.3763196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.3763920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.3764447Z kernel = self.compile( 2025-05-07T20:32:53.3764978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.3765626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.3766022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.3766245Z 2025-05-07T20:32:53.3766457Z self = 2025-05-07T20:32:53.3767520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.3768883Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79eca2520>} 2025-05-07T20:32:53.3770204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.3771213Z context = 2025-05-07T20:32:53.3771499Z 2025-05-07T20:32:53.3771669Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.3772181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.3772684Z module_map=module_map) 2025-05-07T20:32:53.3773145Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.3773493Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.3773758Z E ^ 2025-05-07T20:32:53.3774224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.3774666Z 2025-05-07T20:32:53.3775085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.3775587Z 2025-05-07T20:32:53.3775695Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.3776103Z self=, 2025-05-07T20:32:53.3776507Z T=128, 2025-05-07T20:32:53.3776690Z D=5120, 2025-05-07T20:32:53.3776885Z scale_ub=None, 2025-05-07T20:32:53.3777102Z contiguous=True, 2025-05-07T20:32:53.3777324Z compiled=False, 2025-05-07T20:32:53.3777532Z ) 2025-05-07T20:32:53.4459924Z self = 2025-05-07T20:32:53.4460940Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.4461290Z 2025-05-07T20:32:53.4461480Z @given( 2025-05-07T20:32:53.4461729Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.4462038Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.4462350Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.4462688Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.4463011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.4463384Z ) 2025-05-07T20:32:53.4463739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.4464174Z def test_silu_mul_quant( 2025-05-07T20:32:53.4464419Z self, 2025-05-07T20:32:53.4464621Z T: int, 2025-05-07T20:32:53.4464818Z D: int, 2025-05-07T20:32:53.4465042Z scale_ub: Optional[float], 2025-05-07T20:32:53.4465315Z contiguous: bool, 2025-05-07T20:32:53.4465556Z compiled: bool, 2025-05-07T20:32:53.4465784Z ) -> None: 2025-05-07T20:32:53.4466005Z torch.manual_seed(2025) 2025-05-07T20:32:53.4466322Z 2025-05-07T20:32:53.4466595Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.4466938Z 2025-05-07T20:32:53.4467139Z x_sign = torch.sign(x) 2025-05-07T20:32:53.4467426Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.4467741Z x = x_sign * x_clamp 2025-05-07T20:32:53.4467987Z x0 = x[:, :D] 2025-05-07T20:32:53.4468204Z x1 = x[:, D:] 2025-05-07T20:32:53.4468420Z 2025-05-07T20:32:53.4468607Z if contiguous: 2025-05-07T20:32:53.4468834Z x0 = x0.contiguous() 2025-05-07T20:32:53.4469096Z x1 = x1.contiguous() 2025-05-07T20:32:53.4469345Z 2025-05-07T20:32:53.4469534Z if scale_ub is not None: 2025-05-07T20:32:53.4469809Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.4470148Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.4470456Z ) 2025-05-07T20:32:53.4470659Z else: 2025-05-07T20:32:53.4470875Z scale_ub_tensor = None 2025-05-07T20:32:53.4471120Z 2025-05-07T20:32:53.4471354Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.4471671Z op = silu_mul_quant 2025-05-07T20:32:53.4471926Z if compiled: 2025-05-07T20:32:53.4472172Z op = torch.compile(op) 2025-05-07T20:32:53.4472475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4472799Z 2025-05-07T20:32:53.4472988Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.4473162Z 2025-05-07T20:32:53.4473263Z moe/activation_test.py:117: 2025-05-07T20:32:53.4473564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4473898Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.4474187Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4474881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.4475574Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.4476114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.4476794Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.4477460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.4477988Z kernel = self.compile( 2025-05-07T20:32:53.4478531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.4479182Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.4479588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4479863Z 2025-05-07T20:32:53.4480072Z self = 2025-05-07T20:32:53.4481185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.4482567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79eca3420>} 2025-05-07T20:32:53.4483977Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.4484990Z context = 2025-05-07T20:32:53.4485275Z 2025-05-07T20:32:53.4485446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.4486020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.4486490Z module_map=module_map) 2025-05-07T20:32:53.4486851Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.4487215Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.4487478Z E ^ 2025-05-07T20:32:53.4487943Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.4488388Z 2025-05-07T20:32:53.4488800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.4489309Z 2025-05-07T20:32:53.4489413Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.4489829Z self=, 2025-05-07T20:32:53.4490228Z T=128, 2025-05-07T20:32:53.4490422Z D=7168, 2025-05-07T20:32:53.4490623Z scale_ub=None, 2025-05-07T20:32:53.4490841Z contiguous=True, 2025-05-07T20:32:53.4491068Z compiled=False, 2025-05-07T20:32:53.4491285Z ) 2025-05-07T20:32:53.4491611Z self = 2025-05-07T20:32:53.4492098Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.4492370Z 2025-05-07T20:32:53.4492448Z @given( 2025-05-07T20:32:53.4492682Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.4493151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.4493459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.4493786Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.4494109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.4494397Z ) 2025-05-07T20:32:53.4494746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.4495187Z def test_silu_mul_quant( 2025-05-07T20:32:53.4495422Z self, 2025-05-07T20:32:53.4495621Z T: int, 2025-05-07T20:32:53.4495825Z D: int, 2025-05-07T20:32:53.4496046Z scale_ub: Optional[float], 2025-05-07T20:32:53.4496322Z contiguous: bool, 2025-05-07T20:32:53.4496565Z compiled: bool, 2025-05-07T20:32:53.4496788Z ) -> None: 2025-05-07T20:32:53.4497006Z torch.manual_seed(2025) 2025-05-07T20:32:53.4497252Z 2025-05-07T20:32:53.4497525Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.4497872Z 2025-05-07T20:32:53.4498074Z x_sign = torch.sign(x) 2025-05-07T20:32:53.4498365Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.4498682Z x = x_sign * x_clamp 2025-05-07T20:32:53.4498925Z x0 = x[:, :D] 2025-05-07T20:32:53.4499139Z x1 = x[:, D:] 2025-05-07T20:32:53.4499407Z 2025-05-07T20:32:53.4499602Z if contiguous: 2025-05-07T20:32:53.4499842Z x0 = x0.contiguous() 2025-05-07T20:32:53.4500140Z x1 = x1.contiguous() 2025-05-07T20:32:53.4500393Z 2025-05-07T20:32:53.4500591Z if scale_ub is not None: 2025-05-07T20:32:53.4500863Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.4501205Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.4501513Z ) 2025-05-07T20:32:53.4501701Z else: 2025-05-07T20:32:53.4501913Z scale_ub_tensor = None 2025-05-07T20:32:53.4502215Z 2025-05-07T20:32:53.4502451Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.4502813Z op = silu_mul_quant 2025-05-07T20:32:53.4503071Z if compiled: 2025-05-07T20:32:53.4503317Z op = torch.compile(op) 2025-05-07T20:32:53.4503614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4503892Z 2025-05-07T20:32:53.4504086Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.4504257Z 2025-05-07T20:32:53.4504357Z moe/activation_test.py:117: 2025-05-07T20:32:53.4504745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4505078Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.4505352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.4506036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.4506715Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.4507245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.4507920Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.4508577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.4509108Z kernel = self.compile( 2025-05-07T20:32:53.4509642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.4510294Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.4510689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.4510916Z 2025-05-07T20:32:53.4511129Z self = 2025-05-07T20:32:53.4512194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.4513602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ea944a0>} 2025-05-07T20:32:53.4522058Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.4523143Z context = 2025-05-07T20:32:53.4523435Z 2025-05-07T20:32:53.4523606Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.4524124Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.4524592Z module_map=module_map) 2025-05-07T20:32:53.4524964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.4525315Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.4525578Z E ^ 2025-05-07T20:32:53.4526042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.4526486Z 2025-05-07T20:32:53.4526989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.4527502Z 2025-05-07T20:32:53.4527650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.4528069Z self=, 2025-05-07T20:32:53.4528471Z T=2048, 2025-05-07T20:32:53.4528658Z D=7168, 2025-05-07T20:32:53.4528856Z scale_ub=1200.0, 2025-05-07T20:32:53.4529081Z contiguous=True, 2025-05-07T20:32:53.4529303Z compiled=False, 2025-05-07T20:32:53.4529522Z ) 2025-05-07T20:32:53.5340759Z self = 2025-05-07T20:32:53.5341546Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.5341916Z 2025-05-07T20:32:53.5342035Z @given( 2025-05-07T20:32:53.5342300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5342629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5342956Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5343282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5343877Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5344171Z ) 2025-05-07T20:32:53.5344523Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5344959Z def test_silu_mul_quant( 2025-05-07T20:32:53.5345206Z self, 2025-05-07T20:32:53.5345407Z T: int, 2025-05-07T20:32:53.5345603Z D: int, 2025-05-07T20:32:53.5345831Z scale_ub: Optional[float], 2025-05-07T20:32:53.5346108Z contiguous: bool, 2025-05-07T20:32:53.5346343Z compiled: bool, 2025-05-07T20:32:53.5346571Z ) -> None: 2025-05-07T20:32:53.5346786Z torch.manual_seed(2025) 2025-05-07T20:32:53.5347024Z 2025-05-07T20:32:53.5347303Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5349357Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.5351203Z 2025-05-07T20:32:53.5351323Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.5351535Z 2025-05-07T20:32:53.5351647Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5352060Z self=, 2025-05-07T20:32:53.5352466Z T=1, 2025-05-07T20:32:53.5352684Z D=5120, 2025-05-07T20:32:53.5352904Z scale_ub=1200.0, 2025-05-07T20:32:53.5353139Z contiguous=True, 2025-05-07T20:32:53.5353372Z compiled=False, 2025-05-07T20:32:53.5353576Z ) 2025-05-07T20:32:53.5353902Z self = 2025-05-07T20:32:53.5354393Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.5354660Z 2025-05-07T20:32:53.5354744Z @given( 2025-05-07T20:32:53.5354972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5355289Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5355599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5355935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5356273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5356568Z ) 2025-05-07T20:32:53.5356915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5357358Z def test_silu_mul_quant( 2025-05-07T20:32:53.5357692Z self, 2025-05-07T20:32:53.5357888Z T: int, 2025-05-07T20:32:53.5358099Z D: int, 2025-05-07T20:32:53.5358404Z scale_ub: Optional[float], 2025-05-07T20:32:53.5358681Z contiguous: bool, 2025-05-07T20:32:53.5358933Z compiled: bool, 2025-05-07T20:32:53.5359162Z ) -> None: 2025-05-07T20:32:53.5359664Z torch.manual_seed(2025) 2025-05-07T20:32:53.5359904Z 2025-05-07T20:32:53.5360177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5360519Z 2025-05-07T20:32:53.5360711Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5361092Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5361405Z x = x_sign * x_clamp 2025-05-07T20:32:53.5361642Z x0 = x[:, :D] 2025-05-07T20:32:53.5361873Z x1 = x[:, D:] 2025-05-07T20:32:53.5362088Z 2025-05-07T20:32:53.5362271Z if contiguous: 2025-05-07T20:32:53.5362509Z x0 = x0.contiguous() 2025-05-07T20:32:53.5362786Z x1 = x1.contiguous() 2025-05-07T20:32:53.5363069Z 2025-05-07T20:32:53.5363265Z if scale_ub is not None: 2025-05-07T20:32:53.5363611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.5363943Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.5364257Z ) 2025-05-07T20:32:53.5364464Z else: 2025-05-07T20:32:53.5364681Z scale_ub_tensor = None 2025-05-07T20:32:53.5364945Z 2025-05-07T20:32:53.5365187Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5365509Z op = silu_mul_quant 2025-05-07T20:32:53.5365753Z if compiled: 2025-05-07T20:32:53.5366010Z op = torch.compile(op) 2025-05-07T20:32:53.5366316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5366597Z 2025-05-07T20:32:53.5366801Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.5366967Z 2025-05-07T20:32:53.5367072Z moe/activation_test.py:117: 2025-05-07T20:32:53.5367371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5367709Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.5368003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5368695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.5369379Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.5369918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.5370610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.5371264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.5371812Z kernel = self.compile( 2025-05-07T20:32:53.5372370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.5373107Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.5373508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5373750Z 2025-05-07T20:32:53.5373961Z self = 2025-05-07T20:32:53.5375047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.5376417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ea95a80>} 2025-05-07T20:32:53.5377752Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.5378903Z context = 2025-05-07T20:32:53.5379198Z 2025-05-07T20:32:53.5379370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.5379896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.5380362Z module_map=module_map) 2025-05-07T20:32:53.5380742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.5381149Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.5381430Z E ^ 2025-05-07T20:32:53.5381887Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5382347Z 2025-05-07T20:32:53.5382811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.5383322Z 2025-05-07T20:32:53.5383436Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5383885Z self=, 2025-05-07T20:32:53.5384291Z T=2048, 2025-05-07T20:32:53.5384488Z D=5120, 2025-05-07T20:32:53.5384684Z scale_ub=None, 2025-05-07T20:32:53.5384898Z contiguous=True, 2025-05-07T20:32:53.5385131Z compiled=False, 2025-05-07T20:32:53.5385340Z ) 2025-05-07T20:32:53.5385661Z self = 2025-05-07T20:32:53.5386154Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.5386427Z 2025-05-07T20:32:53.5386515Z @given( 2025-05-07T20:32:53.5386747Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5387075Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5387393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5387728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5388077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5388374Z ) 2025-05-07T20:32:53.5388741Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5389186Z def test_silu_mul_quant( 2025-05-07T20:32:53.5389435Z self, 2025-05-07T20:32:53.5389647Z T: int, 2025-05-07T20:32:53.5389845Z D: int, 2025-05-07T20:32:53.5390076Z scale_ub: Optional[float], 2025-05-07T20:32:53.5390359Z contiguous: bool, 2025-05-07T20:32:53.5390601Z compiled: bool, 2025-05-07T20:32:53.5390841Z ) -> None: 2025-05-07T20:32:53.5391070Z torch.manual_seed(2025) 2025-05-07T20:32:53.5391316Z 2025-05-07T20:32:53.5391602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5391956Z 2025-05-07T20:32:53.5392155Z > x_sign = torch.sign(x) 2025-05-07T20:32:53.5394146Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.5396020Z 2025-05-07T20:32:53.5396142Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:53.5396363Z 2025-05-07T20:32:53.5396500Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5397072Z self=, 2025-05-07T20:32:53.5397604Z T=16384, 2025-05-07T20:32:53.5397803Z D=5120, 2025-05-07T20:32:53.5398000Z scale_ub=None, 2025-05-07T20:32:53.5398209Z contiguous=True, 2025-05-07T20:32:53.5398497Z compiled=False, 2025-05-07T20:32:53.5398698Z ) 2025-05-07T20:32:53.6160714Z self = 2025-05-07T20:32:53.6161870Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.6162427Z 2025-05-07T20:32:53.6162596Z @given( 2025-05-07T20:32:53.6162848Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6163162Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6163470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6163890Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6164218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6164509Z ) 2025-05-07T20:32:53.6164861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6165309Z def test_silu_mul_quant( 2025-05-07T20:32:53.6165569Z self, 2025-05-07T20:32:53.6165771Z T: int, 2025-05-07T20:32:53.6165975Z D: int, 2025-05-07T20:32:53.6166197Z scale_ub: Optional[float], 2025-05-07T20:32:53.6166469Z contiguous: bool, 2025-05-07T20:32:53.6166786Z compiled: bool, 2025-05-07T20:32:53.6167020Z ) -> None: 2025-05-07T20:32:53.6167239Z torch.manual_seed(2025) 2025-05-07T20:32:53.6167482Z 2025-05-07T20:32:53.6167763Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6169825Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.6171690Z 2025-05-07T20:32:53.6171812Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.6172025Z 2025-05-07T20:32:53.6172134Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6172548Z self=, 2025-05-07T20:32:53.6173094Z T=4096, 2025-05-07T20:32:53.6173287Z D=5120, 2025-05-07T20:32:53.6173476Z scale_ub=None, 2025-05-07T20:32:53.6173697Z contiguous=True, 2025-05-07T20:32:53.6173922Z compiled=False, 2025-05-07T20:32:53.6174125Z ) 2025-05-07T20:32:53.6174442Z self = 2025-05-07T20:32:53.6174928Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.6175195Z 2025-05-07T20:32:53.6175282Z @given( 2025-05-07T20:32:53.6175509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6175825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6176128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6176451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6176778Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6177067Z ) 2025-05-07T20:32:53.6177414Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6177852Z def test_silu_mul_quant( 2025-05-07T20:32:53.6178091Z self, 2025-05-07T20:32:53.6178297Z T: int, 2025-05-07T20:32:53.6178500Z D: int, 2025-05-07T20:32:53.6178719Z scale_ub: Optional[float], 2025-05-07T20:32:53.6178990Z contiguous: bool, 2025-05-07T20:32:53.6179226Z compiled: bool, 2025-05-07T20:32:53.6179448Z ) -> None: 2025-05-07T20:32:53.6179668Z torch.manual_seed(2025) 2025-05-07T20:32:53.6179914Z 2025-05-07T20:32:53.6180188Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6182324Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.6184185Z 2025-05-07T20:32:53.6184311Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.6184524Z 2025-05-07T20:32:53.6184636Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6185043Z self=, 2025-05-07T20:32:53.6185445Z T=2048, 2025-05-07T20:32:53.6185634Z D=5120, 2025-05-07T20:32:53.6185823Z scale_ub=None, 2025-05-07T20:32:53.6186041Z contiguous=False, 2025-05-07T20:32:53.6186274Z compiled=False, 2025-05-07T20:32:53.6186476Z ) 2025-05-07T20:32:53.6186839Z self = 2025-05-07T20:32:53.6187337Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.6187608Z 2025-05-07T20:32:53.6187687Z @given( 2025-05-07T20:32:53.6187921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6188233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6188544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6188869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6189201Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6189500Z ) 2025-05-07T20:32:53.6189848Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6190295Z def test_silu_mul_quant( 2025-05-07T20:32:53.6190550Z self, 2025-05-07T20:32:53.6190743Z T: int, 2025-05-07T20:32:53.6190947Z D: int, 2025-05-07T20:32:53.6191172Z scale_ub: Optional[float], 2025-05-07T20:32:53.6191443Z contiguous: bool, 2025-05-07T20:32:53.6191687Z compiled: bool, 2025-05-07T20:32:53.6191913Z ) -> None: 2025-05-07T20:32:53.6192126Z torch.manual_seed(2025) 2025-05-07T20:32:53.6192366Z 2025-05-07T20:32:53.6192637Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6194641Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.6196474Z 2025-05-07T20:32:53.6196605Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.6196814Z 2025-05-07T20:32:53.6196916Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6197333Z self=, 2025-05-07T20:32:53.6197730Z T=4096, 2025-05-07T20:32:53.6197913Z D=7168, 2025-05-07T20:32:53.6198112Z scale_ub=None, 2025-05-07T20:32:53.6198334Z contiguous=True, 2025-05-07T20:32:53.6198551Z compiled=True, 2025-05-07T20:32:53.6198763Z ) 2025-05-07T20:32:53.6199083Z self = 2025-05-07T20:32:53.6199571Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.6199836Z 2025-05-07T20:32:53.6199916Z @given( 2025-05-07T20:32:53.6200202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6200516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6200897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6201233Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6201561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6201843Z ) 2025-05-07T20:32:53.6202186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6202620Z def test_silu_mul_quant( 2025-05-07T20:32:53.6202857Z self, 2025-05-07T20:32:53.6203092Z T: int, 2025-05-07T20:32:53.6203294Z D: int, 2025-05-07T20:32:53.6203518Z scale_ub: Optional[float], 2025-05-07T20:32:53.6203787Z contiguous: bool, 2025-05-07T20:32:53.6204037Z compiled: bool, 2025-05-07T20:32:53.6204262Z ) -> None: 2025-05-07T20:32:53.6204477Z torch.manual_seed(2025) 2025-05-07T20:32:53.6204725Z 2025-05-07T20:32:53.6205009Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6207058Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.6208883Z 2025-05-07T20:32:53.6209006Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.6209224Z 2025-05-07T20:32:53.6209328Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6209754Z self=, 2025-05-07T20:32:53.6210162Z T=2048, 2025-05-07T20:32:53.6210351Z D=5120, 2025-05-07T20:32:53.6210556Z scale_ub=1200.0, 2025-05-07T20:32:53.6210791Z contiguous=False, 2025-05-07T20:32:53.6211025Z compiled=False, 2025-05-07T20:32:53.6211235Z ) 2025-05-07T20:32:53.6211571Z self = 2025-05-07T20:32:53.6212059Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:53.6212339Z 2025-05-07T20:32:53.6212424Z @given( 2025-05-07T20:32:53.6212655Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6213026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6213328Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6213655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6213985Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6214264Z ) 2025-05-07T20:32:53.6214616Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6215056Z def test_silu_mul_quant( 2025-05-07T20:32:53.6215289Z self, 2025-05-07T20:32:53.6215497Z T: int, 2025-05-07T20:32:53.6215698Z D: int, 2025-05-07T20:32:53.6215914Z scale_ub: Optional[float], 2025-05-07T20:32:53.6216187Z contiguous: bool, 2025-05-07T20:32:53.6216434Z compiled: bool, 2025-05-07T20:32:53.6216653Z ) -> None: 2025-05-07T20:32:53.6216870Z torch.manual_seed(2025) 2025-05-07T20:32:53.6217116Z 2025-05-07T20:32:53.6217391Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6219445Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.6221298Z 2025-05-07T20:32:53.6221419Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.6221638Z 2025-05-07T20:32:53.6221746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6222159Z self=, 2025-05-07T20:32:53.6222561Z T=4096, 2025-05-07T20:32:53.6222798Z D=7168, 2025-05-07T20:32:53.6222993Z scale_ub=1200.0, 2025-05-07T20:32:53.6223212Z contiguous=True, 2025-05-07T20:32:53.6223437Z compiled=False, 2025-05-07T20:32:53.6223655Z ) 2025-05-07T20:32:53.7302495Z self = 2025-05-07T20:32:53.7303248Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.7303547Z 2025-05-07T20:32:53.7303639Z @given( 2025-05-07T20:32:53.7303876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.7304464Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.7304791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.7305123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.7305464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.7305767Z ) 2025-05-07T20:32:53.7306123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.7306594Z def test_silu_mul_quant( 2025-05-07T20:32:53.7306858Z self, 2025-05-07T20:32:53.7307065Z T: int, 2025-05-07T20:32:53.7307282Z D: int, 2025-05-07T20:32:53.7307511Z scale_ub: Optional[float], 2025-05-07T20:32:53.7307793Z contiguous: bool, 2025-05-07T20:32:53.7308048Z compiled: bool, 2025-05-07T20:32:53.7308283Z ) -> None: 2025-05-07T20:32:53.7308514Z torch.manual_seed(2025) 2025-05-07T20:32:53.7308767Z 2025-05-07T20:32:53.7309054Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.7311109Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.7313024Z 2025-05-07T20:32:53.7313158Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.7313373Z 2025-05-07T20:32:53.7313480Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.7313907Z self=, 2025-05-07T20:32:53.7314316Z T=16384, 2025-05-07T20:32:53.7314520Z D=7168, 2025-05-07T20:32:53.7314719Z scale_ub=None, 2025-05-07T20:32:53.7314952Z contiguous=False, 2025-05-07T20:32:53.7315188Z compiled=True, 2025-05-07T20:32:53.7315400Z ) 2025-05-07T20:32:53.7315727Z self = 2025-05-07T20:32:53.7316227Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:53.7316504Z 2025-05-07T20:32:53.7316585Z @given( 2025-05-07T20:32:53.7316829Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.7317152Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.7317460Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.7317794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.7318135Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.7318504Z ) 2025-05-07T20:32:53.7318851Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.7319379Z def test_silu_mul_quant( 2025-05-07T20:32:53.7319632Z self, 2025-05-07T20:32:53.7327960Z T: int, 2025-05-07T20:32:53.7328221Z D: int, 2025-05-07T20:32:53.7328468Z scale_ub: Optional[float], 2025-05-07T20:32:53.7328770Z contiguous: bool, 2025-05-07T20:32:53.7329034Z compiled: bool, 2025-05-07T20:32:53.7329281Z ) -> None: 2025-05-07T20:32:53.7329513Z torch.manual_seed(2025) 2025-05-07T20:32:53.7329894Z 2025-05-07T20:32:53.7330182Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.7332261Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.7334253Z 2025-05-07T20:32:53.7334375Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.7334595Z 2025-05-07T20:32:53.7334702Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.7335116Z self=, 2025-05-07T20:32:53.7335522Z T=4096, 2025-05-07T20:32:53.7335707Z D=7168, 2025-05-07T20:32:53.7335902Z scale_ub=None, 2025-05-07T20:32:53.7336119Z contiguous=True, 2025-05-07T20:32:53.7336340Z compiled=False, 2025-05-07T20:32:53.7336543Z ) 2025-05-07T20:32:53.7336861Z self = 2025-05-07T20:32:53.7337335Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.7337608Z 2025-05-07T20:32:53.7337688Z @given( 2025-05-07T20:32:53.7337927Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.7338247Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.7338549Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.7338881Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.7339213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.7339492Z ) 2025-05-07T20:32:53.7339840Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.7340284Z def test_silu_mul_quant( 2025-05-07T20:32:53.7340527Z self, 2025-05-07T20:32:53.7340726Z T: int, 2025-05-07T20:32:53.7340928Z D: int, 2025-05-07T20:32:53.7341145Z scale_ub: Optional[float], 2025-05-07T20:32:53.7341416Z contiguous: bool, 2025-05-07T20:32:53.7341663Z compiled: bool, 2025-05-07T20:32:53.7341888Z ) -> None: 2025-05-07T20:32:53.7342105Z torch.manual_seed(2025) 2025-05-07T20:32:53.7342348Z 2025-05-07T20:32:53.7342631Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.7344681Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.7346510Z 2025-05-07T20:32:53.7346630Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.7346847Z 2025-05-07T20:32:53.7347009Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.7347425Z self=, 2025-05-07T20:32:53.7347858Z T=16384, 2025-05-07T20:32:53.7348063Z D=7168, 2025-05-07T20:32:53.7348264Z scale_ub=None, 2025-05-07T20:32:53.7348473Z contiguous=True, 2025-05-07T20:32:53.7348704Z compiled=False, 2025-05-07T20:32:53.7348912Z ) 2025-05-07T20:32:53.7349230Z self = 2025-05-07T20:32:53.7349716Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:53.7350037Z 2025-05-07T20:32:53.7350118Z @given( 2025-05-07T20:32:53.7350351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.7350658Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.7350971Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.7351306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.7351630Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.7351919Z ) 2025-05-07T20:32:53.7352314Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.7352785Z def test_silu_mul_quant( 2025-05-07T20:32:53.7353046Z self, 2025-05-07T20:32:53.7353243Z T: int, 2025-05-07T20:32:53.7353453Z D: int, 2025-05-07T20:32:53.7353668Z scale_ub: Optional[float], 2025-05-07T20:32:53.7353945Z contiguous: bool, 2025-05-07T20:32:53.7354193Z compiled: bool, 2025-05-07T20:32:53.7354414Z ) -> None: 2025-05-07T20:32:53.7354632Z torch.manual_seed(2025) 2025-05-07T20:32:53.7354882Z 2025-05-07T20:32:53.7355150Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.7357167Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.7358996Z 2025-05-07T20:32:53.7359117Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.7359796Z 2025-05-07T20:32:53.7359905Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.7360322Z self=, 2025-05-07T20:32:53.7360716Z T=16384, 2025-05-07T20:32:53.7360908Z D=7168, 2025-05-07T20:32:53.7361110Z scale_ub=1200.0, 2025-05-07T20:32:53.7361325Z contiguous=True, 2025-05-07T20:32:53.7361549Z compiled=False, 2025-05-07T20:32:53.7361754Z ) 2025-05-07T20:32:53.7362070Z self = 2025-05-07T20:32:53.7362566Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.7362865Z 2025-05-07T20:32:53.7362965Z @given( 2025-05-07T20:32:53.7363219Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.7363527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.7363837Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.7364168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.7364492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.7364787Z ) 2025-05-07T20:32:53.7365138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.7365570Z def test_silu_mul_quant( 2025-05-07T20:32:53.7365816Z self, 2025-05-07T20:32:53.7366019Z T: int, 2025-05-07T20:32:53.7366219Z D: int, 2025-05-07T20:32:53.7366447Z scale_ub: Optional[float], 2025-05-07T20:32:53.7366810Z contiguous: bool, 2025-05-07T20:32:53.7367046Z compiled: bool, 2025-05-07T20:32:53.7367278Z ) -> None: 2025-05-07T20:32:53.7367554Z torch.manual_seed(2025) 2025-05-07T20:32:53.7367800Z 2025-05-07T20:32:53.7368077Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.7370093Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.7371982Z 2025-05-07T20:32:53.7372103Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.7372315Z 2025-05-07T20:32:53.7372436Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.7372928Z self=, 2025-05-07T20:32:53.7373414Z T=128, 2025-05-07T20:32:53.7373609Z D=5120, 2025-05-07T20:32:53.7373806Z scale_ub=1200.0, 2025-05-07T20:32:53.7374034Z contiguous=False, 2025-05-07T20:32:53.7374270Z compiled=False, 2025-05-07T20:32:53.7374481Z ) 2025-05-07T20:32:53.8662879Z self = 2025-05-07T20:32:53.8663473Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:53.8663759Z 2025-05-07T20:32:53.8663840Z @given( 2025-05-07T20:32:53.8664084Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8664398Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8664716Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8665060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8665392Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8665695Z ) 2025-05-07T20:32:53.8666060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8666514Z def test_silu_mul_quant( 2025-05-07T20:32:53.8666756Z self, 2025-05-07T20:32:53.8666964Z T: int, 2025-05-07T20:32:53.8667165Z D: int, 2025-05-07T20:32:53.8667388Z scale_ub: Optional[float], 2025-05-07T20:32:53.8667662Z contiguous: bool, 2025-05-07T20:32:53.8667906Z compiled: bool, 2025-05-07T20:32:53.8668129Z ) -> None: 2025-05-07T20:32:53.8668347Z torch.manual_seed(2025) 2025-05-07T20:32:53.8668592Z 2025-05-07T20:32:53.8668863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8669213Z 2025-05-07T20:32:53.8669415Z x_sign = torch.sign(x) 2025-05-07T20:32:53.8669705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.8670019Z x = x_sign * x_clamp 2025-05-07T20:32:53.8670267Z x0 = x[:, :D] 2025-05-07T20:32:53.8670489Z x1 = x[:, D:] 2025-05-07T20:32:53.8670707Z 2025-05-07T20:32:53.8670899Z if contiguous: 2025-05-07T20:32:53.8671137Z x0 = x0.contiguous() 2025-05-07T20:32:53.8671404Z x1 = x1.contiguous() 2025-05-07T20:32:53.8671658Z 2025-05-07T20:32:53.8671861Z if scale_ub is not None: 2025-05-07T20:32:53.8672135Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.8672487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.8672805Z ) 2025-05-07T20:32:53.8673000Z else: 2025-05-07T20:32:53.8673224Z scale_ub_tensor = None 2025-05-07T20:32:53.8673485Z 2025-05-07T20:32:53.8673719Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.8674041Z op = silu_mul_quant 2025-05-07T20:32:53.8674575Z if compiled: 2025-05-07T20:32:53.8674829Z op = torch.compile(op) 2025-05-07T20:32:53.8675214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8675502Z 2025-05-07T20:32:53.8675699Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.8675863Z 2025-05-07T20:32:53.8675965Z moe/activation_test.py:117: 2025-05-07T20:32:53.8676265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8676597Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.8676873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8677646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.8678331Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.8678875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.8679548Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.8680283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.8680817Z kernel = self.compile( 2025-05-07T20:32:53.8681364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.8682014Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.8682413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8682641Z 2025-05-07T20:32:53.8682859Z self = 2025-05-07T20:32:53.8683936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.8685305Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79e8247c0>} 2025-05-07T20:32:53.8686636Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.8687649Z context = 2025-05-07T20:32:53.8687935Z 2025-05-07T20:32:53.8688108Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.8688623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.8689093Z module_map=module_map) 2025-05-07T20:32:53.8689459Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.8689812Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.8690070Z E ^ 2025-05-07T20:32:53.8690536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.8690989Z 2025-05-07T20:32:53.8691407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.8691912Z 2025-05-07T20:32:53.8692020Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8692439Z self=, 2025-05-07T20:32:53.8692865Z T=2048, 2025-05-07T20:32:53.8693220Z D=7168, 2025-05-07T20:32:53.8693413Z scale_ub=None, 2025-05-07T20:32:53.8693633Z contiguous=False, 2025-05-07T20:32:53.8693859Z compiled=False, 2025-05-07T20:32:53.8694070Z ) 2025-05-07T20:32:53.8694395Z self = 2025-05-07T20:32:53.8694890Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.8695242Z 2025-05-07T20:32:53.8695321Z @given( 2025-05-07T20:32:53.8695599Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8695920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8696226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8696562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8696903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8697191Z ) 2025-05-07T20:32:53.8697547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8698039Z def test_silu_mul_quant( 2025-05-07T20:32:53.8698291Z self, 2025-05-07T20:32:53.8698498Z T: int, 2025-05-07T20:32:53.8698705Z D: int, 2025-05-07T20:32:53.8698936Z scale_ub: Optional[float], 2025-05-07T20:32:53.8699205Z contiguous: bool, 2025-05-07T20:32:53.8699452Z compiled: bool, 2025-05-07T20:32:53.8699684Z ) -> None: 2025-05-07T20:32:53.8699901Z torch.manual_seed(2025) 2025-05-07T20:32:53.8700150Z 2025-05-07T20:32:53.8700477Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8702514Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.8704402Z 2025-05-07T20:32:53.8704524Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:53.8704748Z 2025-05-07T20:32:53.8704853Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8705273Z self=, 2025-05-07T20:32:53.8705681Z T=128, 2025-05-07T20:32:53.8705876Z D=7168, 2025-05-07T20:32:53.8706084Z scale_ub=1200.0, 2025-05-07T20:32:53.8706319Z contiguous=True, 2025-05-07T20:32:53.8706548Z compiled=True, 2025-05-07T20:32:53.8706764Z ) 2025-05-07T20:32:53.9019956Z self = 2025-05-07T20:32:53.9020616Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.9020885Z 2025-05-07T20:32:53.9020987Z @given( 2025-05-07T20:32:53.9021216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9021530Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9021843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9022168Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9022507Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9022841Z ) 2025-05-07T20:32:53.9023222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9023679Z def test_silu_mul_quant( 2025-05-07T20:32:53.9023938Z self, 2025-05-07T20:32:53.9024145Z T: int, 2025-05-07T20:32:53.9024345Z D: int, 2025-05-07T20:32:53.9024575Z scale_ub: Optional[float], 2025-05-07T20:32:53.9024861Z contiguous: bool, 2025-05-07T20:32:53.9025107Z compiled: bool, 2025-05-07T20:32:53.9025345Z ) -> None: 2025-05-07T20:32:53.9025575Z torch.manual_seed(2025) 2025-05-07T20:32:53.9025820Z 2025-05-07T20:32:53.9026101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9026449Z 2025-05-07T20:32:53.9026645Z x_sign = torch.sign(x) 2025-05-07T20:32:53.9026941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.9027268Z x = x_sign * x_clamp 2025-05-07T20:32:53.9027682Z x0 = x[:, :D] 2025-05-07T20:32:53.9027914Z x1 = x[:, D:] 2025-05-07T20:32:53.9028132Z 2025-05-07T20:32:53.9028317Z if contiguous: 2025-05-07T20:32:53.9028634Z x0 = x0.contiguous() 2025-05-07T20:32:53.9028900Z x1 = x1.contiguous() 2025-05-07T20:32:53.9029141Z 2025-05-07T20:32:53.9029333Z if scale_ub is not None: 2025-05-07T20:32:53.9029612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.9029949Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.9030259Z ) 2025-05-07T20:32:53.9030529Z else: 2025-05-07T20:32:53.9030749Z scale_ub_tensor = None 2025-05-07T20:32:53.9031001Z 2025-05-07T20:32:53.9031241Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.9031556Z op = silu_mul_quant 2025-05-07T20:32:53.9031808Z if compiled: 2025-05-07T20:32:53.9032068Z op = torch.compile(op) 2025-05-07T20:32:53.9032375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9032648Z 2025-05-07T20:32:53.9032857Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.9033030Z 2025-05-07T20:32:53.9033200Z moe/activation_test.py:117: 2025-05-07T20:32:53.9033502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9033833Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.9034120Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9034684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.9035243Z return fn(*args, **kwargs) 2025-05-07T20:32:53.9035902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.9036585Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.9037123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.9037798Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.9038472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.9039005Z kernel = self.compile( 2025-05-07T20:32:53.9039548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.9040203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.9040607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9040837Z 2025-05-07T20:32:53.9041053Z self = 2025-05-07T20:32:53.9042118Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.9043488Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79e825940>} 2025-05-07T20:32:53.9044811Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.9045823Z context = 2025-05-07T20:32:53.9046111Z 2025-05-07T20:32:53.9046285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.9046803Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.9047274Z module_map=module_map) 2025-05-07T20:32:53.9047648Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.9048051Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.9048316Z E ^ 2025-05-07T20:32:53.9048826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.9049272Z 2025-05-07T20:32:53.9049693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.9050196Z 2025-05-07T20:32:53.9050301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9050716Z self=, 2025-05-07T20:32:53.9051156Z T=128, 2025-05-07T20:32:53.9051343Z D=7168, 2025-05-07T20:32:53.9051546Z scale_ub=1200.0, 2025-05-07T20:32:53.9051774Z contiguous=True, 2025-05-07T20:32:53.9051995Z compiled=False, 2025-05-07T20:32:53.9052209Z ) 2025-05-07T20:32:53.9052534Z self = 2025-05-07T20:32:53.9053153Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.9053420Z 2025-05-07T20:32:53.9053498Z @given( 2025-05-07T20:32:53.9053779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9054099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9054403Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9054735Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9055069Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9055355Z ) 2025-05-07T20:32:53.9055710Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9056154Z def test_silu_mul_quant( 2025-05-07T20:32:53.9056401Z self, 2025-05-07T20:32:53.9056594Z T: int, 2025-05-07T20:32:53.9056801Z D: int, 2025-05-07T20:32:53.9057026Z scale_ub: Optional[float], 2025-05-07T20:32:53.9057296Z contiguous: bool, 2025-05-07T20:32:53.9057545Z compiled: bool, 2025-05-07T20:32:53.9057769Z ) -> None: 2025-05-07T20:32:53.9057984Z torch.manual_seed(2025) 2025-05-07T20:32:53.9058233Z 2025-05-07T20:32:53.9058516Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9058859Z 2025-05-07T20:32:53.9059065Z x_sign = torch.sign(x) 2025-05-07T20:32:53.9059684Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.9061681Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.9063512Z 2025-05-07T20:32:53.9063644Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.9063858Z 2025-05-07T20:32:53.9063973Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9064393Z self=, 2025-05-07T20:32:53.9064798Z T=128, 2025-05-07T20:32:53.9064984Z D=5120, 2025-05-07T20:32:53.9065183Z scale_ub=1200.0, 2025-05-07T20:32:53.9065415Z contiguous=True, 2025-05-07T20:32:53.9065640Z compiled=True, 2025-05-07T20:32:53.9065857Z ) 2025-05-07T20:32:53.9066179Z self = 2025-05-07T20:32:53.9066663Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.9066933Z 2025-05-07T20:32:53.9067011Z @given( 2025-05-07T20:32:53.9067242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9067558Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9067954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9068354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9068690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9068973Z ) 2025-05-07T20:32:53.9069325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9069768Z def test_silu_mul_quant( 2025-05-07T20:32:53.9070012Z self, 2025-05-07T20:32:53.9070210Z T: int, 2025-05-07T20:32:53.9070409Z D: int, 2025-05-07T20:32:53.9070722Z scale_ub: Optional[float], 2025-05-07T20:32:53.9070998Z contiguous: bool, 2025-05-07T20:32:53.9071244Z compiled: bool, 2025-05-07T20:32:53.9071469Z ) -> None: 2025-05-07T20:32:53.9071692Z torch.manual_seed(2025) 2025-05-07T20:32:53.9071937Z 2025-05-07T20:32:53.9072212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9072554Z 2025-05-07T20:32:53.9072756Z x_sign = torch.sign(x) 2025-05-07T20:32:53.9073055Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.9075111Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.9076936Z 2025-05-07T20:32:53.9077059Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.9077278Z 2025-05-07T20:32:53.9077382Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9077794Z self=, 2025-05-07T20:32:53.9078198Z T=128, 2025-05-07T20:32:53.9078384Z D=7168, 2025-05-07T20:32:53.9078587Z scale_ub=None, 2025-05-07T20:32:53.9078810Z contiguous=True, 2025-05-07T20:32:53.9079031Z compiled=True, 2025-05-07T20:32:53.9079240Z ) 2025-05-07T20:32:54.1574812Z self = 2025-05-07T20:32:54.1575360Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.1575629Z 2025-05-07T20:32:54.1575722Z @given( 2025-05-07T20:32:54.1575977Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1576300Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1576612Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1576940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1577270Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1577559Z ) 2025-05-07T20:32:54.1577918Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1578369Z def test_silu_mul_quant( 2025-05-07T20:32:54.1578635Z self, 2025-05-07T20:32:54.1578838Z T: int, 2025-05-07T20:32:54.1579034Z D: int, 2025-05-07T20:32:54.1579259Z scale_ub: Optional[float], 2025-05-07T20:32:54.1579539Z contiguous: bool, 2025-05-07T20:32:54.1579781Z compiled: bool, 2025-05-07T20:32:54.1580023Z ) -> None: 2025-05-07T20:32:54.1580247Z torch.manual_seed(2025) 2025-05-07T20:32:54.1580498Z 2025-05-07T20:32:54.1580788Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1583111Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1585131Z 2025-05-07T20:32:54.1585256Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.1585480Z 2025-05-07T20:32:54.1600295Z FAILED 2025-05-07T20:32:54.1600539Z 2025-05-07T20:32:54.1600780Z =================================== FAILURES =================================== 2025-05-07T20:32:54.1601625Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:54.1602247Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:54.1603031Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:54.1603585Z | yield 2025-05-07T20:32:54.1604130Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:32:54.1604857Z | self._callTestMethod(testMethod) 2025-05-07T20:32:54.1605868Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:32:54.1606697Z | if method() is not None: 2025-05-07T20:32:54.1607043Z | ^^^^^^^^ 2025-05-07T20:32:54.1607982Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:54.1609267Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1609694Z | ^^^^^^^ 2025-05-07T20:32:54.1610512Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:54.1611424Z | raise the_error_hypothesis_found 2025-05-07T20:32:54.1612029Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:54.1612649Z +-+---------------- 1 ---------------- 2025-05-07T20:32:54.1613253Z | Traceback (most recent call last): 2025-05-07T20:32:54.1614284Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:54.1615413Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1615954Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.1618867Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1621791Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.1622427Z | self=, 2025-05-07T20:32:54.1623022Z | T=2048, 2025-05-07T20:32:54.1623363Z | D=5120, # or any other generated value 2025-05-07T20:32:54.1623851Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:54.1624383Z | contiguous=True, # or any other generated value 2025-05-07T20:32:54.1624915Z | compiled=False, # or any other generated value 2025-05-07T20:32:54.1625352Z | ) 2025-05-07T20:32:54.1625599Z | 2025-05-07T20:32:54.1626364Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:54.1627261Z +---------------- 2 ---------------- 2025-05-07T20:32:54.1627754Z | Traceback (most recent call last): 2025-05-07T20:32:54.1628841Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:54.1629986Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1630533Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.1633874Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1637688Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.1638544Z | self=, 2025-05-07T20:32:54.1639259Z | T=128, 2025-05-07T20:32:54.1639603Z | D=7168, 2025-05-07T20:32:54.1639959Z | scale_ub=None, 2025-05-07T20:32:54.1640361Z | contiguous=True, 2025-05-07T20:32:54.1640763Z | compiled=True, 2025-05-07T20:32:54.1641135Z | ) 2025-05-07T20:32:54.1641449Z | 2025-05-07T20:32:54.1642382Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:54.1643504Z +---------------- 3 ---------------- 2025-05-07T20:32:54.1643980Z | Traceback (most recent call last): 2025-05-07T20:32:54.1645125Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:54.1646411Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1647004Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.1650120Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.1653183Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.1653828Z | self=, 2025-05-07T20:32:54.1654418Z | T=128, 2025-05-07T20:32:54.1654709Z | D=5120, 2025-05-07T20:32:54.1655019Z | scale_ub=1200.0, 2025-05-07T20:32:54.1655294Z | contiguous=True, 2025-05-07T20:32:54.1655551Z | compiled=True, 2025-05-07T20:32:54.1655787Z | ) 2025-05-07T20:32:54.1655982Z | 2025-05-07T20:32:54.1656516Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:54.1657118Z +---------------- 4 ---------------- 2025-05-07T20:32:54.1657421Z | Traceback (most recent call last): 2025-05-07T20:32:54.1658140Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:54.1658849Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.1659142Z | ^^^^^^^^ 2025-05-07T20:32:54.1660263Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:54.1660957Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.1661358Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.1662227Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:54.1663007Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.1663692Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:54.1664406Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1664842Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.1665477Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:54.1666299Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.1666754Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.1667380Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:54.1668064Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.1668430Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.1669019Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:54.1669574Z | fn() 2025-05-07T20:32:54.1670133Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:54.1670747Z | self.fn.run( 2025-05-07T20:32:54.1671270Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:54.1671837Z | kernel = self.compile( 2025-05-07T20:32:54.1672091Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:54.1672681Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:54.1673426Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1673819Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.1674443Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:54.1675219Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1675695Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.1676067Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1676418Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.1676680Z | ^ 2025-05-07T20:32:54.1677133Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1677685Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.1678086Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:54.1678592Z | self=, 2025-05-07T20:32:54.1679018Z | T=1, # or any other generated value 2025-05-07T20:32:54.1679329Z | D=5120, # or any other generated value 2025-05-07T20:32:54.1679737Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:54.1680095Z | contiguous=True, # or any other generated value 2025-05-07T20:32:54.1680495Z | compiled=True, # or any other generated value 2025-05-07T20:32:54.1680796Z | ) 2025-05-07T20:32:54.1680975Z | 2025-05-07T20:32:54.1681490Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:54.1682123Z +------------------------------------ 2025-05-07T20:32:54.1682631Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:54.1683214Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1683789Z self=, 2025-05-07T20:32:54.1684340Z T=1, 2025-05-07T20:32:54.1684602Z D=5120, 2025-05-07T20:32:54.1684864Z scale_ub=None, 2025-05-07T20:32:54.1685166Z contiguous=True, 2025-05-07T20:32:54.1685489Z compiled=True, 2025-05-07T20:32:54.1685779Z ) 2025-05-07T20:32:54.1686222Z self = 2025-05-07T20:32:54.1686940Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.1687300Z 2025-05-07T20:32:54.1687413Z @given( 2025-05-07T20:32:54.1687736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1688174Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1688596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1689064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1689529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1689933Z ) 2025-05-07T20:32:54.1690416Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1691032Z def test_silu_mul_quant( 2025-05-07T20:32:54.1691379Z self, 2025-05-07T20:32:54.1810281Z T: int, 2025-05-07T20:32:54.1810597Z D: int, 2025-05-07T20:32:54.1810901Z scale_ub: Optional[float], 2025-05-07T20:32:54.1811306Z contiguous: bool, 2025-05-07T20:32:54.1811657Z compiled: bool, 2025-05-07T20:32:54.1811971Z ) -> None: 2025-05-07T20:32:54.1812251Z torch.manual_seed(2025) 2025-05-07T20:32:54.1812569Z 2025-05-07T20:32:54.1812931Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1813461Z 2025-05-07T20:32:54.1813714Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1814110Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1814518Z x = x_sign * x_clamp 2025-05-07T20:32:54.1814839Z x0 = x[:, :D] 2025-05-07T20:32:54.1815133Z x1 = x[:, D:] 2025-05-07T20:32:54.1815414Z 2025-05-07T20:32:54.1815658Z if contiguous: 2025-05-07T20:32:54.1815969Z x0 = x0.contiguous() 2025-05-07T20:32:54.1816316Z x1 = x1.contiguous() 2025-05-07T20:32:54.1816635Z 2025-05-07T20:32:54.1816892Z if scale_ub is not None: 2025-05-07T20:32:54.1817260Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1817707Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1818116Z ) 2025-05-07T20:32:54.1818377Z else: 2025-05-07T20:32:54.1818652Z scale_ub_tensor = None 2025-05-07T20:32:54.1818996Z 2025-05-07T20:32:54.1819322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1819762Z op = silu_mul_quant 2025-05-07T20:32:54.1820119Z if compiled: 2025-05-07T20:32:54.1820464Z op = torch.compile(op) 2025-05-07T20:32:54.1820867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1821258Z 2025-05-07T20:32:54.1821522Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.1821916Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.1822296Z 2025-05-07T20:32:54.1822912Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1823262Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.1823682Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.1824000Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.1824357Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.1824659Z 2025-05-07T20:32:54.1824860Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.1825052Z 2025-05-07T20:32:54.1825159Z moe/activation_test.py:126: 2025-05-07T20:32:54.1825544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1825879Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.1826205Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.1826995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.1827739Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.1828368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1829048Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1829731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.1830443Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.1831177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.1831810Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.1832399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.1832912Z fn() 2025-05-07T20:32:54.1833459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.1834046Z self.fn.run( 2025-05-07T20:32:54.1834510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1835043Z kernel = self.compile( 2025-05-07T20:32:54.1835582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1836221Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1836629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1836863Z 2025-05-07T20:32:54.1837069Z self = 2025-05-07T20:32:54.1838144Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1839542Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc89114dc60>} 2025-05-07T20:32:54.1840864Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1841882Z context = 2025-05-07T20:32:54.1842176Z 2025-05-07T20:32:54.1842340Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1842860Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1843318Z module_map=module_map) 2025-05-07T20:32:54.1843693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1844099Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.1844359Z E ^ 2025-05-07T20:32:54.1844872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1845320Z 2025-05-07T20:32:54.1845733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.1846235Z 2025-05-07T20:32:54.1846343Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1846792Z self=, 2025-05-07T20:32:54.1847185Z T=2048, 2025-05-07T20:32:54.1847373Z D=5120, 2025-05-07T20:32:54.1847560Z scale_ub=1200.0, 2025-05-07T20:32:54.1847784Z contiguous=True, 2025-05-07T20:32:54.1848002Z compiled=False, 2025-05-07T20:32:54.1848207Z ) 2025-05-07T20:32:54.1848521Z self = 2025-05-07T20:32:54.1849011Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.1849280Z 2025-05-07T20:32:54.1849409Z @given( 2025-05-07T20:32:54.1849638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1849948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1850253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1850573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1850903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1851197Z ) 2025-05-07T20:32:54.1851544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1851986Z def test_silu_mul_quant( 2025-05-07T20:32:54.1852229Z self, 2025-05-07T20:32:54.1852417Z T: int, 2025-05-07T20:32:54.1852614Z D: int, 2025-05-07T20:32:54.1852838Z scale_ub: Optional[float], 2025-05-07T20:32:54.1853240Z contiguous: bool, 2025-05-07T20:32:54.1853475Z compiled: bool, 2025-05-07T20:32:54.1853698Z ) -> None: 2025-05-07T20:32:54.1853918Z torch.manual_seed(2025) 2025-05-07T20:32:54.1854159Z 2025-05-07T20:32:54.1854435Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1854777Z 2025-05-07T20:32:54.1854970Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1855264Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1855575Z x = x_sign * x_clamp 2025-05-07T20:32:54.1855811Z x0 = x[:, :D] 2025-05-07T20:32:54.1856036Z x1 = x[:, D:] 2025-05-07T20:32:54.1856248Z 2025-05-07T20:32:54.1856433Z if contiguous: 2025-05-07T20:32:54.1856669Z x0 = x0.contiguous() 2025-05-07T20:32:54.1856932Z x1 = x1.contiguous() 2025-05-07T20:32:54.1857170Z 2025-05-07T20:32:54.1857368Z if scale_ub is not None: 2025-05-07T20:32:54.1857648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1857983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1858295Z ) 2025-05-07T20:32:54.1858493Z else: 2025-05-07T20:32:54.1858709Z scale_ub_tensor = None 2025-05-07T20:32:54.1858959Z 2025-05-07T20:32:54.1859487Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1859853Z op = silu_mul_quant 2025-05-07T20:32:54.1860098Z if compiled: 2025-05-07T20:32:54.1860348Z op = torch.compile(op) 2025-05-07T20:32:54.1860647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1860921Z 2025-05-07T20:32:54.1861120Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.1861283Z 2025-05-07T20:32:54.1861389Z moe/activation_test.py:117: 2025-05-07T20:32:54.1861678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1862011Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.1862405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1863202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.1863878Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.1864413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1865085Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1865738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1866329Z kernel = self.compile( 2025-05-07T20:32:54.1866867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1867509Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1867895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1868134Z 2025-05-07T20:32:54.1868341Z self = 2025-05-07T20:32:54.1869456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1870808Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc890db0220>} 2025-05-07T20:32:54.1872120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1873142Z context = 2025-05-07T20:32:54.1873437Z 2025-05-07T20:32:54.1873602Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1874120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1874581Z module_map=module_map) 2025-05-07T20:32:54.1891963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1892437Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.1892696Z E ^ 2025-05-07T20:32:54.1893250Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1893710Z 2025-05-07T20:32:54.1894134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.1894641Z 2025-05-07T20:32:54.1894755Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1895162Z self=, 2025-05-07T20:32:54.1895567Z T=2048, 2025-05-07T20:32:54.1895758Z D=5120, 2025-05-07T20:32:54.1895946Z scale_ub=1200.0, 2025-05-07T20:32:54.1896181Z contiguous=True, 2025-05-07T20:32:54.1896409Z compiled=True, 2025-05-07T20:32:54.1896616Z ) 2025-05-07T20:32:54.1896928Z self = 2025-05-07T20:32:54.1897413Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.1897680Z 2025-05-07T20:32:54.1897763Z @given( 2025-05-07T20:32:54.1897988Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1898307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1898614Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1898940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1899267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1899557Z ) 2025-05-07T20:32:54.1899904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1900435Z def test_silu_mul_quant( 2025-05-07T20:32:54.1900719Z self, 2025-05-07T20:32:54.1900916Z T: int, 2025-05-07T20:32:54.1901113Z D: int, 2025-05-07T20:32:54.1901335Z scale_ub: Optional[float], 2025-05-07T20:32:54.1901609Z contiguous: bool, 2025-05-07T20:32:54.1901842Z compiled: bool, 2025-05-07T20:32:54.1902071Z ) -> None: 2025-05-07T20:32:54.1902285Z torch.manual_seed(2025) 2025-05-07T20:32:54.1902521Z 2025-05-07T20:32:54.1902793Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1903183Z 2025-05-07T20:32:54.1903371Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1903659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1903968Z x = x_sign * x_clamp 2025-05-07T20:32:54.1904203Z x0 = x[:, :D] 2025-05-07T20:32:54.1904424Z x1 = x[:, D:] 2025-05-07T20:32:54.1904643Z 2025-05-07T20:32:54.1904826Z if contiguous: 2025-05-07T20:32:54.1905057Z x0 = x0.contiguous() 2025-05-07T20:32:54.1905359Z x1 = x1.contiguous() 2025-05-07T20:32:54.1905609Z 2025-05-07T20:32:54.1905797Z if scale_ub is not None: 2025-05-07T20:32:54.1906072Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1906411Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1906715Z ) 2025-05-07T20:32:54.1906903Z else: 2025-05-07T20:32:54.1907111Z scale_ub_tensor = None 2025-05-07T20:32:54.1907351Z 2025-05-07T20:32:54.1907573Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1907879Z op = silu_mul_quant 2025-05-07T20:32:54.1908119Z if compiled: 2025-05-07T20:32:54.1908373Z op = torch.compile(op) 2025-05-07T20:32:54.1908662Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1908925Z 2025-05-07T20:32:54.1909116Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.1909404Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.1909680Z 2025-05-07T20:32:54.1909923Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1910256Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.1910542Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.1910846Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.1911194Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.1911507Z 2025-05-07T20:32:54.1911700Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.1911893Z 2025-05-07T20:32:54.1911995Z moe/activation_test.py:126: 2025-05-07T20:32:54.1912291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1912615Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.1912939Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.1913721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.1914462Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.1914998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1915668Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1916341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.1917047Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.1917756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.1918386Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.1919027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.1919575Z fn() 2025-05-07T20:32:54.1920072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.1920653Z self.fn.run( 2025-05-07T20:32:54.1921110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1921621Z kernel = self.compile( 2025-05-07T20:32:54.1922146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1922826Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1923211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1923440Z 2025-05-07T20:32:54.1923641Z self = 2025-05-07T20:32:54.1924749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1926099Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc890db16c0>} 2025-05-07T20:32:54.1927418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1928419Z context = 2025-05-07T20:32:54.1928714Z 2025-05-07T20:32:54.1928876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1929388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1929853Z module_map=module_map) 2025-05-07T20:32:54.1930212Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1930565Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.1930831Z E ^ 2025-05-07T20:32:54.1931280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1931722Z 2025-05-07T20:32:54.1932130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.1932639Z 2025-05-07T20:32:54.1932740Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1933257Z self=, 2025-05-07T20:32:54.1933698Z T=16384, 2025-05-07T20:32:54.1933894Z D=7168, 2025-05-07T20:32:54.1934086Z scale_ub=1200.0, 2025-05-07T20:32:54.1934303Z contiguous=False, 2025-05-07T20:32:54.1934534Z compiled=False, 2025-05-07T20:32:54.1934737Z ) 2025-05-07T20:32:54.1935046Z self = 2025-05-07T20:32:54.1935540Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.1935823Z 2025-05-07T20:32:54.1935899Z @given( 2025-05-07T20:32:54.1936127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1936427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1936727Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1937052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1937366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1937651Z ) 2025-05-07T20:32:54.1937996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1938422Z def test_silu_mul_quant( 2025-05-07T20:32:54.1938718Z self, 2025-05-07T20:32:54.1938910Z T: int, 2025-05-07T20:32:54.1939098Z D: int, 2025-05-07T20:32:54.1939356Z scale_ub: Optional[float], 2025-05-07T20:32:54.1939638Z contiguous: bool, 2025-05-07T20:32:54.1939880Z compiled: bool, 2025-05-07T20:32:54.1940097Z ) -> None: 2025-05-07T20:32:54.1940316Z torch.manual_seed(2025) 2025-05-07T20:32:54.1940561Z 2025-05-07T20:32:54.1940821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1941161Z 2025-05-07T20:32:54.1941362Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1941691Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1942004Z x = x_sign * x_clamp 2025-05-07T20:32:54.1942249Z x0 = x[:, :D] 2025-05-07T20:32:54.1942462Z x1 = x[:, D:] 2025-05-07T20:32:54.1942680Z 2025-05-07T20:32:54.1942895Z if contiguous: 2025-05-07T20:32:54.1943151Z x0 = x0.contiguous() 2025-05-07T20:32:54.1943418Z x1 = x1.contiguous() 2025-05-07T20:32:54.1943661Z 2025-05-07T20:32:54.1943846Z if scale_ub is not None: 2025-05-07T20:32:54.1944159Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1944499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1944794Z ) 2025-05-07T20:32:54.1944987Z else: 2025-05-07T20:32:54.1945202Z scale_ub_tensor = None 2025-05-07T20:32:54.1945449Z 2025-05-07T20:32:54.1945672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1945982Z op = silu_mul_quant 2025-05-07T20:32:54.1946233Z if compiled: 2025-05-07T20:32:54.1946475Z op = torch.compile(op) 2025-05-07T20:32:54.1946771Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1947044Z 2025-05-07T20:32:54.1947232Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.1947405Z 2025-05-07T20:32:54.1947503Z moe/activation_test.py:117: 2025-05-07T20:32:54.1947803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1948139Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.1948420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1949105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.1949789Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.1950324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1950998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1951652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1952179Z kernel = self.compile( 2025-05-07T20:32:54.1952703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1953355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1953760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1953982Z 2025-05-07T20:32:54.1954195Z self = 2025-05-07T20:32:54.1955251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1956604Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88bd28540>} 2025-05-07T20:32:54.1957922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1959024Z context = 2025-05-07T20:32:54.1959582Z 2025-05-07T20:32:54.1959756Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1960263Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1960727Z module_map=module_map) 2025-05-07T20:32:54.1961093Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1961594Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.1961859Z E ^ 2025-05-07T20:32:54.1962318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.1962755Z 2025-05-07T20:32:54.1963169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.1963670Z 2025-05-07T20:32:54.1963776Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.1964252Z self=, 2025-05-07T20:32:54.1964658Z T=1, 2025-05-07T20:32:54.1964843Z D=7168, 2025-05-07T20:32:54.1965042Z scale_ub=None, 2025-05-07T20:32:54.1965259Z contiguous=True, 2025-05-07T20:32:54.1965471Z compiled=True, 2025-05-07T20:32:54.1965689Z ) 2025-05-07T20:32:54.1966010Z self = 2025-05-07T20:32:54.1966486Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.1966749Z 2025-05-07T20:32:54.1966827Z @given( 2025-05-07T20:32:54.1967065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.1967378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.1967686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.1968014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.1968345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.1968626Z ) 2025-05-07T20:32:54.1968977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.1969416Z def test_silu_mul_quant( 2025-05-07T20:32:54.1969652Z self, 2025-05-07T20:32:54.1969855Z T: int, 2025-05-07T20:32:54.1970057Z D: int, 2025-05-07T20:32:54.1970272Z scale_ub: Optional[float], 2025-05-07T20:32:54.1970551Z contiguous: bool, 2025-05-07T20:32:54.1970796Z compiled: bool, 2025-05-07T20:32:54.1971014Z ) -> None: 2025-05-07T20:32:54.1971234Z torch.manual_seed(2025) 2025-05-07T20:32:54.1971482Z 2025-05-07T20:32:54.1971749Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.1972096Z 2025-05-07T20:32:54.1972294Z x_sign = torch.sign(x) 2025-05-07T20:32:54.1972585Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.1972892Z x = x_sign * x_clamp 2025-05-07T20:32:54.1973198Z x0 = x[:, :D] 2025-05-07T20:32:54.1973423Z x1 = x[:, D:] 2025-05-07T20:32:54.1973631Z 2025-05-07T20:32:54.1973825Z if contiguous: 2025-05-07T20:32:54.1974061Z x0 = x0.contiguous() 2025-05-07T20:32:54.1974317Z x1 = x1.contiguous() 2025-05-07T20:32:54.1974563Z 2025-05-07T20:32:54.1974768Z if scale_ub is not None: 2025-05-07T20:32:54.1975040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.1975382Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.1975700Z ) 2025-05-07T20:32:54.1975895Z else: 2025-05-07T20:32:54.1976121Z scale_ub_tensor = None 2025-05-07T20:32:54.1976376Z 2025-05-07T20:32:54.1976601Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1976918Z op = silu_mul_quant 2025-05-07T20:32:54.1977167Z if compiled: 2025-05-07T20:32:54.1977500Z op = torch.compile(op) 2025-05-07T20:32:54.1977785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.1978123Z 2025-05-07T20:32:54.1978317Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.1978598Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.1978886Z 2025-05-07T20:32:54.1979124Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.1979447Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.1979738Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.1980097Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.1980450Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.1980762Z 2025-05-07T20:32:54.1980965Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.1981155Z 2025-05-07T20:32:54.1981260Z moe/activation_test.py:126: 2025-05-07T20:32:54.1981546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1981881Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.1982247Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.1983069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.1983815Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.1984355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.1985024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.1985697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.1986408Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.1987123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.1987749Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.1988337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.1988842Z fn() 2025-05-07T20:32:54.1989343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.1989908Z self.fn.run( 2025-05-07T20:32:54.1990368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.1990889Z kernel = self.compile( 2025-05-07T20:32:54.1991417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.1992055Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.1992449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.1992672Z 2025-05-07T20:32:54.1992884Z self = 2025-05-07T20:32:54.1993936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.1995282Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88bd28e00>} 2025-05-07T20:32:54.1996599Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.1997602Z context = 2025-05-07T20:32:54.1997932Z 2025-05-07T20:32:54.1998101Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.1998648Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.1999115Z module_map=module_map) 2025-05-07T20:32:54.1999478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.1999826Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.2000088Z E ^ 2025-05-07T20:32:54.2000545Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2001029Z 2025-05-07T20:32:54.2001440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2001940Z 2025-05-07T20:32:54.2002043Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2002448Z self=, 2025-05-07T20:32:54.2002841Z T=4096, 2025-05-07T20:32:54.2003019Z D=5120, 2025-05-07T20:32:54.2003213Z scale_ub=None, 2025-05-07T20:32:54.2003472Z contiguous=False, 2025-05-07T20:32:54.2003737Z compiled=False, 2025-05-07T20:32:54.2003944Z ) 2025-05-07T20:32:54.2004259Z self = 2025-05-07T20:32:54.2004747Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2005013Z 2025-05-07T20:32:54.2005088Z @given( 2025-05-07T20:32:54.2005322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2005630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2005924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2006249Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2006578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2006853Z ) 2025-05-07T20:32:54.2007198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2007635Z def test_silu_mul_quant( 2025-05-07T20:32:54.2007878Z self, 2025-05-07T20:32:54.2008070Z T: int, 2025-05-07T20:32:54.2008268Z D: int, 2025-05-07T20:32:54.2008486Z scale_ub: Optional[float], 2025-05-07T20:32:54.2008746Z contiguous: bool, 2025-05-07T20:32:54.2008982Z compiled: bool, 2025-05-07T20:32:54.2009203Z ) -> None: 2025-05-07T20:32:54.2009410Z torch.manual_seed(2025) 2025-05-07T20:32:54.2009648Z 2025-05-07T20:32:54.2009919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2010250Z 2025-05-07T20:32:54.2010451Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2010736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2011032Z x = x_sign * x_clamp 2025-05-07T20:32:54.2011274Z x0 = x[:, :D] 2025-05-07T20:32:54.2011490Z x1 = x[:, D:] 2025-05-07T20:32:54.2011696Z 2025-05-07T20:32:54.2011885Z if contiguous: 2025-05-07T20:32:54.2012119Z x0 = x0.contiguous() 2025-05-07T20:32:54.2012378Z x1 = x1.contiguous() 2025-05-07T20:32:54.2012608Z 2025-05-07T20:32:54.2012804Z if scale_ub is not None: 2025-05-07T20:32:54.2013179Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2013503Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2013805Z ) 2025-05-07T20:32:54.2013997Z else: 2025-05-07T20:32:54.2014203Z scale_ub_tensor = None 2025-05-07T20:32:54.2014450Z 2025-05-07T20:32:54.2014681Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2014989Z op = silu_mul_quant 2025-05-07T20:32:54.2015246Z if compiled: 2025-05-07T20:32:54.2015492Z op = torch.compile(op) 2025-05-07T20:32:54.2015780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2016102Z 2025-05-07T20:32:54.2016298Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2016458Z 2025-05-07T20:32:54.2016604Z moe/activation_test.py:117: 2025-05-07T20:32:54.2016897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2017229Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2017508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2018180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2018907Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2019438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2020111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2020762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2021289Z kernel = self.compile( 2025-05-07T20:32:54.2021713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2021898Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2022023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2022027Z 2025-05-07T20:32:54.2022235Z self = 2025-05-07T20:32:54.2022998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2023499Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc890f7f240>} 2025-05-07T20:32:54.2024240Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2024430Z context = 2025-05-07T20:32:54.2024435Z 2025-05-07T20:32:54.2024609Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2024867Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2024975Z module_map=module_map) 2025-05-07T20:32:54.2025142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2025240Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2025323Z E ^ 2025-05-07T20:32:54.2025675Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2025682Z 2025-05-07T20:32:54.2026091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2026098Z 2025-05-07T20:32:54.2026205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2026423Z self=, 2025-05-07T20:32:54.2026497Z T=4096, 2025-05-07T20:32:54.2026579Z D=7168, 2025-05-07T20:32:54.2026658Z scale_ub=None, 2025-05-07T20:32:54.2026747Z contiguous=False, 2025-05-07T20:32:54.2026831Z compiled=False, 2025-05-07T20:32:54.2026905Z ) 2025-05-07T20:32:54.2027123Z self = 2025-05-07T20:32:54.2027291Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2027295Z 2025-05-07T20:32:54.2027369Z @given( 2025-05-07T20:32:54.2027495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2027638Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2027752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2027945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2028059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2028137Z ) 2025-05-07T20:32:54.2028378Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2028469Z def test_silu_mul_quant( 2025-05-07T20:32:54.2028548Z self, 2025-05-07T20:32:54.2028624Z T: int, 2025-05-07T20:32:54.2028743Z D: int, 2025-05-07T20:32:54.2028849Z scale_ub: Optional[float], 2025-05-07T20:32:54.2028939Z contiguous: bool, 2025-05-07T20:32:54.2029023Z compiled: bool, 2025-05-07T20:32:54.2029103Z ) -> None: 2025-05-07T20:32:54.2029196Z torch.manual_seed(2025) 2025-05-07T20:32:54.2029266Z 2025-05-07T20:32:54.2029438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2029514Z 2025-05-07T20:32:54.2029610Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2029775Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2029865Z x = x_sign * x_clamp 2025-05-07T20:32:54.2029951Z x0 = x[:, :D] 2025-05-07T20:32:54.2030028Z x1 = x[:, D:] 2025-05-07T20:32:54.2030100Z 2025-05-07T20:32:54.2030186Z if contiguous: 2025-05-07T20:32:54.2030278Z x0 = x0.contiguous() 2025-05-07T20:32:54.2030365Z x1 = x1.contiguous() 2025-05-07T20:32:54.2030448Z 2025-05-07T20:32:54.2030539Z if scale_ub is not None: 2025-05-07T20:32:54.2030643Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2030783Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2030860Z ) 2025-05-07T20:32:54.2030936Z else: 2025-05-07T20:32:54.2031036Z scale_ub_tensor = None 2025-05-07T20:32:54.2031116Z 2025-05-07T20:32:54.2031250Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2031342Z op = silu_mul_quant 2025-05-07T20:32:54.2031430Z if compiled: 2025-05-07T20:32:54.2031535Z op = torch.compile(op) 2025-05-07T20:32:54.2031639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2031712Z 2025-05-07T20:32:54.2044508Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2044517Z 2025-05-07T20:32:54.2044639Z moe/activation_test.py:117: 2025-05-07T20:32:54.2044776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2044890Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2044997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2045507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2045614Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2045975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2046210Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2046545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2046638Z kernel = self.compile( 2025-05-07T20:32:54.2047027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2047201Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2047339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2047344Z 2025-05-07T20:32:54.2047551Z self = 2025-05-07T20:32:54.2048319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2048959Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88b181440>} 2025-05-07T20:32:54.2049697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2049937Z context = 2025-05-07T20:32:54.2049942Z 2025-05-07T20:32:54.2050106Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2050370Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2050485Z module_map=module_map) 2025-05-07T20:32:54.2050655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2050762Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2050882Z E ^ 2025-05-07T20:32:54.2051238Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2051243Z 2025-05-07T20:32:54.2051662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2051667Z 2025-05-07T20:32:54.2051770Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2052000Z self=, 2025-05-07T20:32:54.2052079Z T=128, 2025-05-07T20:32:54.2052157Z D=7168, 2025-05-07T20:32:54.2052247Z scale_ub=None, 2025-05-07T20:32:54.2052336Z contiguous=False, 2025-05-07T20:32:54.2052419Z compiled=True, 2025-05-07T20:32:54.2052506Z ) 2025-05-07T20:32:54.2052725Z self = 2025-05-07T20:32:54.2052900Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2052908Z 2025-05-07T20:32:54.2053099Z @given( 2025-05-07T20:32:54.2053222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2053332Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2053446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2053563Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2053685Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2053767Z ) 2025-05-07T20:32:54.2054013Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2054112Z def test_silu_mul_quant( 2025-05-07T20:32:54.2054191Z self, 2025-05-07T20:32:54.2054270Z T: int, 2025-05-07T20:32:54.2054355Z D: int, 2025-05-07T20:32:54.2054454Z scale_ub: Optional[float], 2025-05-07T20:32:54.2054548Z contiguous: bool, 2025-05-07T20:32:54.2054640Z compiled: bool, 2025-05-07T20:32:54.2054723Z ) -> None: 2025-05-07T20:32:54.2054826Z torch.manual_seed(2025) 2025-05-07T20:32:54.2054902Z 2025-05-07T20:32:54.2055073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2055155Z 2025-05-07T20:32:54.2055245Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2055368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2055464Z x = x_sign * x_clamp 2025-05-07T20:32:54.2055549Z x0 = x[:, :D] 2025-05-07T20:32:54.2055628Z x1 = x[:, D:] 2025-05-07T20:32:54.2055708Z 2025-05-07T20:32:54.2055791Z if contiguous: 2025-05-07T20:32:54.2055883Z x0 = x0.contiguous() 2025-05-07T20:32:54.2055984Z x1 = x1.contiguous() 2025-05-07T20:32:54.2056058Z 2025-05-07T20:32:54.2056150Z if scale_ub is not None: 2025-05-07T20:32:54.2056312Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2056446Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2056570Z ) 2025-05-07T20:32:54.2056648Z else: 2025-05-07T20:32:54.2056739Z scale_ub_tensor = None 2025-05-07T20:32:54.2056819Z 2025-05-07T20:32:54.2056947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2057035Z op = silu_mul_quant 2025-05-07T20:32:54.2057124Z if compiled: 2025-05-07T20:32:54.2057225Z op = torch.compile(op) 2025-05-07T20:32:54.2057371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2057451Z 2025-05-07T20:32:54.2057541Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.2057661Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.2057739Z 2025-05-07T20:32:54.2057874Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2057980Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.2058083Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.2058207Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.2058406Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2058481Z 2025-05-07T20:32:54.2058584Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.2058589Z 2025-05-07T20:32:54.2058696Z moe/activation_test.py:126: 2025-05-07T20:32:54.2058823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2058938Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.2059074Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2059988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.2060107Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.2060471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2060699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2061069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.2061324Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.2061703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.2061872Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.2062210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.2062298Z fn() 2025-05-07T20:32:54.2062695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.2062792Z self.fn.run( 2025-05-07T20:32:54.2063140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2063234Z kernel = self.compile( 2025-05-07T20:32:54.2063618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2063793Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2063921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2063929Z 2025-05-07T20:32:54.2064145Z self = 2025-05-07T20:32:54.2064918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2065658Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88af64540>} 2025-05-07T20:32:54.2066396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2066596Z context = 2025-05-07T20:32:54.2066601Z 2025-05-07T20:32:54.2066766Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2067093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2067211Z module_map=module_map) 2025-05-07T20:32:54.2067375Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2067481Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.2067571Z E ^ 2025-05-07T20:32:54.2067986Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2067991Z 2025-05-07T20:32:54.2068408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2068412Z 2025-05-07T20:32:54.2068513Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2068735Z self=, 2025-05-07T20:32:54.2068824Z T=128, 2025-05-07T20:32:54.2068904Z D=7168, 2025-05-07T20:32:54.2068987Z scale_ub=None, 2025-05-07T20:32:54.2069085Z contiguous=False, 2025-05-07T20:32:54.2069169Z compiled=False, 2025-05-07T20:32:54.2069253Z ) 2025-05-07T20:32:54.2069469Z self = 2025-05-07T20:32:54.2069638Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2069645Z 2025-05-07T20:32:54.2069729Z @given( 2025-05-07T20:32:54.2069850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2069955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2070079Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2070195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2070307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2070391Z ) 2025-05-07T20:32:54.2070633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2070736Z def test_silu_mul_quant( 2025-05-07T20:32:54.2070815Z self, 2025-05-07T20:32:54.2070896Z T: int, 2025-05-07T20:32:54.2070978Z D: int, 2025-05-07T20:32:54.2071076Z scale_ub: Optional[float], 2025-05-07T20:32:54.2071165Z contiguous: bool, 2025-05-07T20:32:54.2071256Z compiled: bool, 2025-05-07T20:32:54.2071338Z ) -> None: 2025-05-07T20:32:54.2071430Z torch.manual_seed(2025) 2025-05-07T20:32:54.2071508Z 2025-05-07T20:32:54.2071682Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2071758Z 2025-05-07T20:32:54.2071858Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2071982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2072077Z x = x_sign * x_clamp 2025-05-07T20:32:54.2072160Z x0 = x[:, :D] 2025-05-07T20:32:54.2072239Z x1 = x[:, D:] 2025-05-07T20:32:54.2072315Z 2025-05-07T20:32:54.2072401Z if contiguous: 2025-05-07T20:32:54.2072492Z x0 = x0.contiguous() 2025-05-07T20:32:54.2072587Z x1 = x1.contiguous() 2025-05-07T20:32:54.2072664Z 2025-05-07T20:32:54.2072755Z if scale_ub is not None: 2025-05-07T20:32:54.2072869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2073002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2073153Z ) 2025-05-07T20:32:54.2073237Z else: 2025-05-07T20:32:54.2073331Z scale_ub_tensor = None 2025-05-07T20:32:54.2073447Z 2025-05-07T20:32:54.2073590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2073680Z op = silu_mul_quant 2025-05-07T20:32:54.2073770Z if compiled: 2025-05-07T20:32:54.2073872Z op = torch.compile(op) 2025-05-07T20:32:54.2073981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2074057Z 2025-05-07T20:32:54.2074149Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2074194Z 2025-05-07T20:32:54.2074297Z moe/activation_test.py:117: 2025-05-07T20:32:54.2074436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2074540Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2074641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2075142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2075244Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2075645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2075869Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2076203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2076304Z kernel = self.compile( 2025-05-07T20:32:54.2076685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2076872Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2076999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2077004Z 2025-05-07T20:32:54.2077212Z self = 2025-05-07T20:32:54.2077997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2078497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88af66700>} 2025-05-07T20:32:54.2079237Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2079437Z context = 2025-05-07T20:32:54.2079442Z 2025-05-07T20:32:54.2079604Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2079874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2079988Z module_map=module_map) 2025-05-07T20:32:54.2080159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2080261Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2080341Z E ^ 2025-05-07T20:32:54.2080700Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2080705Z 2025-05-07T20:32:54.2081113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2081121Z 2025-05-07T20:32:54.2081228Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2081453Z self=, 2025-05-07T20:32:54.2081533Z T=4096, 2025-05-07T20:32:54.2081620Z D=5120, 2025-05-07T20:32:54.2081753Z scale_ub=1200.0, 2025-05-07T20:32:54.2081839Z contiguous=True, 2025-05-07T20:32:54.2081930Z compiled=False, 2025-05-07T20:32:54.2082008Z ) 2025-05-07T20:32:54.2082269Z self = 2025-05-07T20:32:54.2082455Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2082460Z 2025-05-07T20:32:54.2082537Z @given( 2025-05-07T20:32:54.2082672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2082795Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2082981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2083107Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2083221Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2083301Z ) 2025-05-07T20:32:54.2083551Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2083643Z def test_silu_mul_quant( 2025-05-07T20:32:54.2083723Z self, 2025-05-07T20:32:54.2083809Z T: int, 2025-05-07T20:32:54.2083886Z D: int, 2025-05-07T20:32:54.2084039Z scale_ub: Optional[float], 2025-05-07T20:32:54.2084140Z contiguous: bool, 2025-05-07T20:32:54.2084227Z compiled: bool, 2025-05-07T20:32:54.2084311Z ) -> None: 2025-05-07T20:32:54.2084405Z torch.manual_seed(2025) 2025-05-07T20:32:54.2084482Z 2025-05-07T20:32:54.2084660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2084733Z 2025-05-07T20:32:54.2084827Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2084959Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2085050Z x = x_sign * x_clamp 2025-05-07T20:32:54.2085132Z x0 = x[:, :D] 2025-05-07T20:32:54.2085219Z x1 = x[:, D:] 2025-05-07T20:32:54.2085293Z 2025-05-07T20:32:54.2085379Z if contiguous: 2025-05-07T20:32:54.2085479Z x0 = x0.contiguous() 2025-05-07T20:32:54.2085569Z x1 = x1.contiguous() 2025-05-07T20:32:54.2085652Z 2025-05-07T20:32:54.2085745Z if scale_ub is not None: 2025-05-07T20:32:54.2085855Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2085993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2086070Z ) 2025-05-07T20:32:54.2086149Z else: 2025-05-07T20:32:54.2086247Z scale_ub_tensor = None 2025-05-07T20:32:54.2086321Z 2025-05-07T20:32:54.2086449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2086550Z op = silu_mul_quant 2025-05-07T20:32:54.2086632Z if compiled: 2025-05-07T20:32:54.2086731Z op = torch.compile(op) 2025-05-07T20:32:54.2086840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2086914Z 2025-05-07T20:32:54.2087011Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2087016Z 2025-05-07T20:32:54.2087113Z moe/activation_test.py:117: 2025-05-07T20:32:54.2087240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2087350Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2087451Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2087944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2088052Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2088404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2088632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2088968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2089065Z kernel = self.compile( 2025-05-07T20:32:54.2089451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2089714Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2089846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2089858Z 2025-05-07T20:32:54.2090063Z self = 2025-05-07T20:32:54.2090829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2091374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88af676a0>} 2025-05-07T20:32:54.2092108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2092345Z context = 2025-05-07T20:32:54.2092350Z 2025-05-07T20:32:54.2092516Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2092777Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2092892Z module_map=module_map) 2025-05-07T20:32:54.2093162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2093281Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2093370Z E ^ 2025-05-07T20:32:54.2093723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2093728Z 2025-05-07T20:32:54.2094145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2094152Z 2025-05-07T20:32:54.2094255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2094482Z self=, 2025-05-07T20:32:54.2094573Z T=1, 2025-05-07T20:32:54.2094652Z D=5120, 2025-05-07T20:32:54.2094741Z scale_ub=None, 2025-05-07T20:32:54.2094826Z contiguous=True, 2025-05-07T20:32:54.2094910Z compiled=True, 2025-05-07T20:32:54.2094994Z ) 2025-05-07T20:32:54.2095212Z self = 2025-05-07T20:32:54.2095376Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2095380Z 2025-05-07T20:32:54.2095466Z @given( 2025-05-07T20:32:54.2095585Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2095685Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2095808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2095931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2096057Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2096133Z ) 2025-05-07T20:32:54.2096385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2096488Z def test_silu_mul_quant( 2025-05-07T20:32:54.2096567Z self, 2025-05-07T20:32:54.2096647Z T: int, 2025-05-07T20:32:54.2096732Z D: int, 2025-05-07T20:32:54.2096831Z scale_ub: Optional[float], 2025-05-07T20:32:54.2096921Z contiguous: bool, 2025-05-07T20:32:54.2097019Z compiled: bool, 2025-05-07T20:32:54.2097096Z ) -> None: 2025-05-07T20:32:54.2097204Z torch.manual_seed(2025) 2025-05-07T20:32:54.2097278Z 2025-05-07T20:32:54.2097445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2097521Z 2025-05-07T20:32:54.2097612Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2097786Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2097887Z x = x_sign * x_clamp 2025-05-07T20:32:54.2097969Z x0 = x[:, :D] 2025-05-07T20:32:54.2098091Z x1 = x[:, D:] 2025-05-07T20:32:54.2098174Z 2025-05-07T20:32:54.2098255Z if contiguous: 2025-05-07T20:32:54.2098351Z x0 = x0.contiguous() 2025-05-07T20:32:54.2098440Z x1 = x1.contiguous() 2025-05-07T20:32:54.2098511Z 2025-05-07T20:32:54.2098609Z if scale_ub is not None: 2025-05-07T20:32:54.2098715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2098890Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2098973Z ) 2025-05-07T20:32:54.2099049Z else: 2025-05-07T20:32:54.2099140Z scale_ub_tensor = None 2025-05-07T20:32:54.2099218Z 2025-05-07T20:32:54.2099345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2099434Z op = silu_mul_quant 2025-05-07T20:32:54.2099527Z if compiled: 2025-05-07T20:32:54.2099625Z op = torch.compile(op) 2025-05-07T20:32:54.2099740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2099879Z 2025-05-07T20:32:54.2099970Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.2100096Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.2100169Z 2025-05-07T20:32:54.2100304Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2100413Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.2100512Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.2100636Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.2100778Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2100849Z 2025-05-07T20:32:54.2100952Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.2100957Z 2025-05-07T20:32:54.2101053Z moe/activation_test.py:126: 2025-05-07T20:32:54.2101181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2101289Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.2101432Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2101980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.2102086Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.2102437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2102666Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2103052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.2103327Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.2103700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.2103866Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.2104205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.2104281Z fn() 2025-05-07T20:32:54.2104676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.2104763Z self.fn.run( 2025-05-07T20:32:54.2105098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2105190Z kernel = self.compile( 2025-05-07T20:32:54.2105569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2105741Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2105923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2105927Z 2025-05-07T20:32:54.2106172Z self = 2025-05-07T20:32:54.2106939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2107440Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88bd43880>} 2025-05-07T20:32:54.2108209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2108417Z context = 2025-05-07T20:32:54.2108424Z 2025-05-07T20:32:54.2108588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2108881Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2108995Z module_map=module_map) 2025-05-07T20:32:54.2109155Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2109256Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.2109338Z E ^ 2025-05-07T20:32:54.2109686Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2109693Z 2025-05-07T20:32:54.2110106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2110111Z 2025-05-07T20:32:54.2110212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2110435Z self=, 2025-05-07T20:32:54.2110517Z T=2048, 2025-05-07T20:32:54.2110593Z D=5120, 2025-05-07T20:32:54.2110679Z scale_ub=None, 2025-05-07T20:32:54.2110771Z contiguous=True, 2025-05-07T20:32:54.2110853Z compiled=True, 2025-05-07T20:32:54.2110934Z ) 2025-05-07T20:32:54.2111149Z self = 2025-05-07T20:32:54.2111315Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2111319Z 2025-05-07T20:32:54.2111406Z @given( 2025-05-07T20:32:54.2111522Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2111622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2111744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2111857Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2111971Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2112058Z ) 2025-05-07T20:32:54.2112300Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2112406Z def test_silu_mul_quant( 2025-05-07T20:32:54.2112483Z self, 2025-05-07T20:32:54.2113156Z T: int, 2025-05-07T20:32:54.2113239Z D: int, 2025-05-07T20:32:54.2113339Z scale_ub: Optional[float], 2025-05-07T20:32:54.2113426Z contiguous: bool, 2025-05-07T20:32:54.2113515Z compiled: bool, 2025-05-07T20:32:54.2113592Z ) -> None: 2025-05-07T20:32:54.2113685Z torch.manual_seed(2025) 2025-05-07T20:32:54.2113765Z 2025-05-07T20:32:54.2113932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2114001Z 2025-05-07T20:32:54.2114097Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2114220Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2114315Z x = x_sign * x_clamp 2025-05-07T20:32:54.2114393Z x0 = x[:, :D] 2025-05-07T20:32:54.2114523Z x1 = x[:, D:] 2025-05-07T20:32:54.2114601Z 2025-05-07T20:32:54.2114682Z if contiguous: 2025-05-07T20:32:54.2114816Z x0 = x0.contiguous() 2025-05-07T20:32:54.2114916Z x1 = x1.contiguous() 2025-05-07T20:32:54.2114988Z 2025-05-07T20:32:54.2115078Z if scale_ub is not None: 2025-05-07T20:32:54.2115190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2115322Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2115393Z ) 2025-05-07T20:32:54.2115474Z else: 2025-05-07T20:32:54.2115608Z scale_ub_tensor = None 2025-05-07T20:32:54.2115682Z 2025-05-07T20:32:54.2115810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2115900Z op = silu_mul_quant 2025-05-07T20:32:54.2115990Z if compiled: 2025-05-07T20:32:54.2116089Z op = torch.compile(op) 2025-05-07T20:32:54.2116194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2116278Z 2025-05-07T20:32:54.2116367Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.2116528Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.2116604Z 2025-05-07T20:32:54.2116739Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2116840Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.2116944Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.2117066Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.2117213Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2117289Z 2025-05-07T20:32:54.2117389Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.2117393Z 2025-05-07T20:32:54.2117495Z moe/activation_test.py:126: 2025-05-07T20:32:54.2117621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2117723Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.2117862Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2118415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.2118520Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.2118873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2119092Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2119460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.2119711Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.2120085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.2120250Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.2120587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.2120668Z fn() 2025-05-07T20:32:54.2121062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.2121143Z self.fn.run( 2025-05-07T20:32:54.2121480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2121576Z kernel = self.compile( 2025-05-07T20:32:54.2121954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2122127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2122252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2122257Z 2025-05-07T20:32:54.2122512Z self = 2025-05-07T20:32:54.2123317Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2123825Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88b6f3ba0>} 2025-05-07T20:32:54.2124555Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2124783Z context = 2025-05-07T20:32:54.2124787Z 2025-05-07T20:32:54.2124955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2125213Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2125364Z module_map=module_map) 2025-05-07T20:32:54.2125524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2125623Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.2125707Z E ^ 2025-05-07T20:32:54.2126055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2126060Z 2025-05-07T20:32:54.2126467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2126479Z 2025-05-07T20:32:54.2126581Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2126800Z self=, 2025-05-07T20:32:54.2126881Z T=128, 2025-05-07T20:32:54.2126959Z D=5120, 2025-05-07T20:32:54.2127041Z scale_ub=None, 2025-05-07T20:32:54.2127131Z contiguous=True, 2025-05-07T20:32:54.2127214Z compiled=True, 2025-05-07T20:32:54.2127293Z ) 2025-05-07T20:32:54.2127518Z self = 2025-05-07T20:32:54.2127682Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2127686Z 2025-05-07T20:32:54.2127766Z @given( 2025-05-07T20:32:54.2127883Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2127986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2128112Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2128226Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2128340Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2128421Z ) 2025-05-07T20:32:54.2128660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2128753Z def test_silu_mul_quant( 2025-05-07T20:32:54.2128837Z self, 2025-05-07T20:32:54.2128913Z T: int, 2025-05-07T20:32:54.2128995Z D: int, 2025-05-07T20:32:54.2129095Z scale_ub: Optional[float], 2025-05-07T20:32:54.2129184Z contiguous: bool, 2025-05-07T20:32:54.2129272Z compiled: bool, 2025-05-07T20:32:54.2129349Z ) -> None: 2025-05-07T20:32:54.2129441Z torch.manual_seed(2025) 2025-05-07T20:32:54.2129520Z 2025-05-07T20:32:54.2129685Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2129758Z 2025-05-07T20:32:54.2129852Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2129974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2130058Z x = x_sign * x_clamp 2025-05-07T20:32:54.2130144Z x0 = x[:, :D] 2025-05-07T20:32:54.2130222Z x1 = x[:, D:] 2025-05-07T20:32:54.2130294Z 2025-05-07T20:32:54.2130384Z if contiguous: 2025-05-07T20:32:54.2130523Z x0 = x0.contiguous() 2025-05-07T20:32:54.2130619Z x1 = x1.contiguous() 2025-05-07T20:32:54.2130696Z 2025-05-07T20:32:54.2130831Z if scale_ub is not None: 2025-05-07T20:32:54.2130943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2131073Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2131145Z ) 2025-05-07T20:32:54.2131225Z else: 2025-05-07T20:32:54.2131317Z scale_ub_tensor = None 2025-05-07T20:32:54.2131390Z 2025-05-07T20:32:54.2131523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2131674Z op = silu_mul_quant 2025-05-07T20:32:54.2131757Z if compiled: 2025-05-07T20:32:54.2131860Z op = torch.compile(op) 2025-05-07T20:32:54.2131967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2132042Z 2025-05-07T20:32:54.2132132Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.2132253Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.2132327Z 2025-05-07T20:32:54.2132464Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2132608Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.2132715Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.2132835Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.2133042Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2133125Z 2025-05-07T20:32:54.2133222Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.2133231Z 2025-05-07T20:32:54.2133335Z moe/activation_test.py:126: 2025-05-07T20:32:54.2133489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2133609Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.2133759Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2134310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.2134414Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.2134776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2134998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2135364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.2135616Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.2135986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.2136157Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.2136493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.2136577Z fn() 2025-05-07T20:32:54.2136976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.2137060Z self.fn.run( 2025-05-07T20:32:54.2137401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2137491Z kernel = self.compile( 2025-05-07T20:32:54.2137864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2138048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2138175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2138180Z 2025-05-07T20:32:54.2138389Z self = 2025-05-07T20:32:54.2139201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2139732Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a4914e0>} 2025-05-07T20:32:54.2140469Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2140695Z context = 2025-05-07T20:32:54.2140699Z 2025-05-07T20:32:54.2140867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2141123Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2141233Z module_map=module_map) 2025-05-07T20:32:54.2141398Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2141538Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.2141624Z E ^ 2025-05-07T20:32:54.2141975Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2141979Z 2025-05-07T20:32:54.2142383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2142391Z 2025-05-07T20:32:54.2142497Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2142718Z self=, 2025-05-07T20:32:54.2142798Z T=4096, 2025-05-07T20:32:54.2142877Z D=5120, 2025-05-07T20:32:54.2142958Z scale_ub=None, 2025-05-07T20:32:54.2143050Z contiguous=True, 2025-05-07T20:32:54.2143130Z compiled=True, 2025-05-07T20:32:54.2143206Z ) 2025-05-07T20:32:54.2143427Z self = 2025-05-07T20:32:54.2143599Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2143603Z 2025-05-07T20:32:54.2143677Z @given( 2025-05-07T20:32:54.2143800Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2143900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2144018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2144140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2144255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2144339Z ) 2025-05-07T20:32:54.2144581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2144672Z def test_silu_mul_quant( 2025-05-07T20:32:54.2144754Z self, 2025-05-07T20:32:54.2144828Z T: int, 2025-05-07T20:32:54.2144910Z D: int, 2025-05-07T20:32:54.2145015Z scale_ub: Optional[float], 2025-05-07T20:32:54.2145103Z contiguous: bool, 2025-05-07T20:32:54.2145189Z compiled: bool, 2025-05-07T20:32:54.2145272Z ) -> None: 2025-05-07T20:32:54.2145366Z torch.manual_seed(2025) 2025-05-07T20:32:54.2145436Z 2025-05-07T20:32:54.2145605Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2145682Z 2025-05-07T20:32:54.2145779Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2145902Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2145994Z x = x_sign * x_clamp 2025-05-07T20:32:54.2146079Z x0 = x[:, :D] 2025-05-07T20:32:54.2146159Z x1 = x[:, D:] 2025-05-07T20:32:54.2146239Z 2025-05-07T20:32:54.2146329Z if contiguous: 2025-05-07T20:32:54.2146423Z x0 = x0.contiguous() 2025-05-07T20:32:54.2146518Z x1 = x1.contiguous() 2025-05-07T20:32:54.2146597Z 2025-05-07T20:32:54.2146734Z if scale_ub is not None: 2025-05-07T20:32:54.2146841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2147019Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2147096Z ) 2025-05-07T20:32:54.2147179Z else: 2025-05-07T20:32:54.2147274Z scale_ub_tensor = None 2025-05-07T20:32:54.2147344Z 2025-05-07T20:32:54.2147479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2147566Z op = silu_mul_quant 2025-05-07T20:32:54.2147651Z if compiled: 2025-05-07T20:32:54.2147799Z op = torch.compile(op) 2025-05-07T20:32:54.2147903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2147974Z 2025-05-07T20:32:54.2148071Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.2148190Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.2148266Z 2025-05-07T20:32:54.2148407Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2148512Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.2148618Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.2148784Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.2148924Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2149002Z 2025-05-07T20:32:54.2149102Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.2149106Z 2025-05-07T20:32:54.2149203Z moe/activation_test.py:126: 2025-05-07T20:32:54.2149335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2149441Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.2149573Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2150129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.2150228Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.2150587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2150812Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2151170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.2151426Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.2151794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.2151967Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.2152299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.2152374Z fn() 2025-05-07T20:32:54.2152772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.2152856Z self.fn.run( 2025-05-07T20:32:54.2153193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2153295Z kernel = self.compile( 2025-05-07T20:32:54.2153713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2153896Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2154024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2154032Z 2025-05-07T20:32:54.2154236Z self = 2025-05-07T20:32:54.2155008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2155587Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a2b54e0>} 2025-05-07T20:32:54.2156323Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2156512Z context = 2025-05-07T20:32:54.2156555Z 2025-05-07T20:32:54.2156726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2156986Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2157092Z module_map=module_map) 2025-05-07T20:32:54.2157261Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2157368Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.2157446Z E ^ 2025-05-07T20:32:54.2157847Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2157852Z 2025-05-07T20:32:54.2158260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2158265Z 2025-05-07T20:32:54.2158373Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2158594Z self=, 2025-05-07T20:32:54.2158673Z T=16384, 2025-05-07T20:32:54.2158755Z D=5120, 2025-05-07T20:32:54.2158835Z scale_ub=None, 2025-05-07T20:32:54.2158922Z contiguous=True, 2025-05-07T20:32:54.2159008Z compiled=True, 2025-05-07T20:32:54.2159082Z ) 2025-05-07T20:32:54.2159589Z self = 2025-05-07T20:32:54.2159829Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2159834Z 2025-05-07T20:32:54.2159911Z @given( 2025-05-07T20:32:54.2160040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2160137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2160252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2160372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2160483Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2160557Z ) 2025-05-07T20:32:54.2160805Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2160896Z def test_silu_mul_quant( 2025-05-07T20:32:54.2160981Z self, 2025-05-07T20:32:54.2161057Z T: int, 2025-05-07T20:32:54.2161133Z D: int, 2025-05-07T20:32:54.2161237Z scale_ub: Optional[float], 2025-05-07T20:32:54.2161325Z contiguous: bool, 2025-05-07T20:32:54.2161410Z compiled: bool, 2025-05-07T20:32:54.2161494Z ) -> None: 2025-05-07T20:32:54.2161585Z torch.manual_seed(2025) 2025-05-07T20:32:54.2161656Z 2025-05-07T20:32:54.2161834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2161909Z 2025-05-07T20:32:54.2161998Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2162128Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2162221Z x = x_sign * x_clamp 2025-05-07T20:32:54.2162300Z x0 = x[:, :D] 2025-05-07T20:32:54.2162383Z x1 = x[:, D:] 2025-05-07T20:32:54.2162458Z 2025-05-07T20:32:54.2162549Z if contiguous: 2025-05-07T20:32:54.2162642Z x0 = x0.contiguous() 2025-05-07T20:32:54.2162731Z x1 = x1.contiguous() 2025-05-07T20:32:54.2162810Z 2025-05-07T20:32:54.2162902Z if scale_ub is not None: 2025-05-07T20:32:54.2163005Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2163143Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2163381Z ) 2025-05-07T20:32:54.2163454Z else: 2025-05-07T20:32:54.2163671Z scale_ub_tensor = None 2025-05-07T20:32:54.2163751Z 2025-05-07T20:32:54.2163896Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2163995Z op = silu_mul_quant 2025-05-07T20:32:54.2164079Z if compiled: 2025-05-07T20:32:54.2164184Z op = torch.compile(op) 2025-05-07T20:32:54.2164290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2164427Z 2025-05-07T20:32:54.2164523Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.2164641Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.2164716Z 2025-05-07T20:32:54.2164857Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2164957Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.2165055Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.2165182Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.2165321Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2165463Z 2025-05-07T20:32:54.2172482Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.2172491Z 2025-05-07T20:32:54.2172598Z moe/activation_test.py:126: 2025-05-07T20:32:54.2172736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2172857Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.2173090Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2173663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.2173767Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.2174125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2174364Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2174735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.2175002Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.2175371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.2175540Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.2175890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.2175967Z fn() 2025-05-07T20:32:54.2176364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.2176459Z self.fn.run( 2025-05-07T20:32:54.2176794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2176893Z kernel = self.compile( 2025-05-07T20:32:54.2177273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2177448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2177585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2177590Z 2025-05-07T20:32:54.2177798Z self = 2025-05-07T20:32:54.2178575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2179074Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fd1afc0>} 2025-05-07T20:32:54.2179951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2180153Z context = 2025-05-07T20:32:54.2180157Z 2025-05-07T20:32:54.2180325Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2180596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2180747Z module_map=module_map) 2025-05-07T20:32:54.2180910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2181017Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.2181094Z E ^ 2025-05-07T20:32:54.2181445Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2181459Z 2025-05-07T20:32:54.2181916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2181921Z 2025-05-07T20:32:54.2182025Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2182255Z self=, 2025-05-07T20:32:54.2182334Z T=1, 2025-05-07T20:32:54.2182413Z D=5120, 2025-05-07T20:32:54.2182504Z scale_ub=1200.0, 2025-05-07T20:32:54.2182593Z contiguous=True, 2025-05-07T20:32:54.2182678Z compiled=True, 2025-05-07T20:32:54.2182763Z ) 2025-05-07T20:32:54.2182979Z self = 2025-05-07T20:32:54.2183152Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.2183156Z 2025-05-07T20:32:54.2183235Z @given( 2025-05-07T20:32:54.2183360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2183466Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2183589Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2183707Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2183830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2183905Z ) 2025-05-07T20:32:54.2184153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2184249Z def test_silu_mul_quant( 2025-05-07T20:32:54.2184331Z self, 2025-05-07T20:32:54.2184416Z T: int, 2025-05-07T20:32:54.2184494Z D: int, 2025-05-07T20:32:54.2184592Z scale_ub: Optional[float], 2025-05-07T20:32:54.2184692Z contiguous: bool, 2025-05-07T20:32:54.2184776Z compiled: bool, 2025-05-07T20:32:54.2184853Z ) -> None: 2025-05-07T20:32:54.2184955Z torch.manual_seed(2025) 2025-05-07T20:32:54.2185032Z 2025-05-07T20:32:54.2185205Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2185287Z 2025-05-07T20:32:54.2185383Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2185508Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2185607Z x = x_sign * x_clamp 2025-05-07T20:32:54.2185686Z x0 = x[:, :D] 2025-05-07T20:32:54.2185774Z x1 = x[:, D:] 2025-05-07T20:32:54.2185848Z 2025-05-07T20:32:54.2185932Z if contiguous: 2025-05-07T20:32:54.2186035Z x0 = x0.contiguous() 2025-05-07T20:32:54.2186128Z x1 = x1.contiguous() 2025-05-07T20:32:54.2186204Z 2025-05-07T20:32:54.2186302Z if scale_ub is not None: 2025-05-07T20:32:54.2186407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2186541Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2186622Z ) 2025-05-07T20:32:54.2186696Z else: 2025-05-07T20:32:54.2186838Z scale_ub_tensor = None 2025-05-07T20:32:54.2186917Z 2025-05-07T20:32:54.2187047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2187188Z op = silu_mul_quant 2025-05-07T20:32:54.2187275Z if compiled: 2025-05-07T20:32:54.2187378Z op = torch.compile(op) 2025-05-07T20:32:54.2187492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2187566Z 2025-05-07T20:32:54.2187659Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2187664Z 2025-05-07T20:32:54.2187770Z moe/activation_test.py:117: 2025-05-07T20:32:54.2187942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2188046Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2188155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2188519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2188620Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2189117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2189258Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2189619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2189840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2190175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2190282Z kernel = self.compile( 2025-05-07T20:32:54.2190656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2190841Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2190971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2190978Z 2025-05-07T20:32:54.2191182Z self = 2025-05-07T20:32:54.2191962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2192458Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a9d9260>} 2025-05-07T20:32:54.2193204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2193394Z context = 2025-05-07T20:32:54.2193399Z 2025-05-07T20:32:54.2193572Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2193835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2193943Z module_map=module_map) 2025-05-07T20:32:54.2194112Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2194214Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2194290Z E ^ 2025-05-07T20:32:54.2194649Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2194656Z 2025-05-07T20:32:54.2195062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2195067Z 2025-05-07T20:32:54.2195176Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2195403Z self=, 2025-05-07T20:32:54.2195527Z T=1, 2025-05-07T20:32:54.2195615Z D=5120, 2025-05-07T20:32:54.2195702Z scale_ub=None, 2025-05-07T20:32:54.2195830Z contiguous=False, 2025-05-07T20:32:54.2195930Z compiled=True, 2025-05-07T20:32:54.2196007Z ) 2025-05-07T20:32:54.2196227Z self = 2025-05-07T20:32:54.2196400Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2196405Z 2025-05-07T20:32:54.2196482Z @given( 2025-05-07T20:32:54.2196611Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2196758Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2196875Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2196997Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2197109Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2197184Z ) 2025-05-07T20:32:54.2197432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2197526Z def test_silu_mul_quant( 2025-05-07T20:32:54.2197612Z self, 2025-05-07T20:32:54.2197693Z T: int, 2025-05-07T20:32:54.2197809Z D: int, 2025-05-07T20:32:54.2197918Z scale_ub: Optional[float], 2025-05-07T20:32:54.2198009Z contiguous: bool, 2025-05-07T20:32:54.2198097Z compiled: bool, 2025-05-07T20:32:54.2198183Z ) -> None: 2025-05-07T20:32:54.2198280Z torch.manual_seed(2025) 2025-05-07T20:32:54.2198354Z 2025-05-07T20:32:54.2198532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2198608Z 2025-05-07T20:32:54.2198699Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2198832Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2198920Z x = x_sign * x_clamp 2025-05-07T20:32:54.2198999Z x0 = x[:, :D] 2025-05-07T20:32:54.2199093Z x1 = x[:, D:] 2025-05-07T20:32:54.2199167Z 2025-05-07T20:32:54.2199258Z if contiguous: 2025-05-07T20:32:54.2199350Z x0 = x0.contiguous() 2025-05-07T20:32:54.2199443Z x1 = x1.contiguous() 2025-05-07T20:32:54.2199527Z 2025-05-07T20:32:54.2199618Z if scale_ub is not None: 2025-05-07T20:32:54.2199723Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2199865Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2199940Z ) 2025-05-07T20:32:54.2200016Z else: 2025-05-07T20:32:54.2200119Z scale_ub_tensor = None 2025-05-07T20:32:54.2200197Z 2025-05-07T20:32:54.2200324Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2200421Z op = silu_mul_quant 2025-05-07T20:32:54.2200508Z if compiled: 2025-05-07T20:32:54.2200617Z op = torch.compile(op) 2025-05-07T20:32:54.2200720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2200793Z 2025-05-07T20:32:54.2200893Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.2201014Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.2201087Z 2025-05-07T20:32:54.2201235Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2201340Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.2201439Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.2201567Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.2201708Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2201780Z 2025-05-07T20:32:54.2201891Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.2201895Z 2025-05-07T20:32:54.2201993Z moe/activation_test.py:126: 2025-05-07T20:32:54.2202128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2202232Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.2202368Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2203049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.2203153Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.2203515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2203769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2204148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.2204451Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.2204817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.2204983Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.2205328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.2205408Z fn() 2025-05-07T20:32:54.2205852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.2205936Z self.fn.run( 2025-05-07T20:32:54.2206268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2206369Z kernel = self.compile( 2025-05-07T20:32:54.2206742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2206919Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2207055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2207059Z 2025-05-07T20:32:54.2207268Z self = 2025-05-07T20:32:54.2208047Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2208544Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a9dbd80>} 2025-05-07T20:32:54.2209285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2209480Z context = 2025-05-07T20:32:54.2209485Z 2025-05-07T20:32:54.2209649Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2209918Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2210033Z module_map=module_map) 2025-05-07T20:32:54.2210206Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2210311Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.2210391Z E ^ 2025-05-07T20:32:54.2210749Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2210753Z 2025-05-07T20:32:54.2211161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2211167Z 2025-05-07T20:32:54.2211271Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2211502Z self=, 2025-05-07T20:32:54.2211580Z T=1, 2025-05-07T20:32:54.2211667Z D=5120, 2025-05-07T20:32:54.2211752Z scale_ub=None, 2025-05-07T20:32:54.2211838Z contiguous=True, 2025-05-07T20:32:54.2211991Z compiled=False, 2025-05-07T20:32:54.2212064Z ) 2025-05-07T20:32:54.2212320Z self = 2025-05-07T20:32:54.2212496Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2212501Z 2025-05-07T20:32:54.2212577Z @given( 2025-05-07T20:32:54.2212697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2212803Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2212916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2213277Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2213392Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2213469Z ) 2025-05-07T20:32:54.2213721Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2213814Z def test_silu_mul_quant( 2025-05-07T20:32:54.2213892Z self, 2025-05-07T20:32:54.2213985Z T: int, 2025-05-07T20:32:54.2214063Z D: int, 2025-05-07T20:32:54.2214164Z scale_ub: Optional[float], 2025-05-07T20:32:54.2214264Z contiguous: bool, 2025-05-07T20:32:54.2214395Z compiled: bool, 2025-05-07T20:32:54.2214473Z ) -> None: 2025-05-07T20:32:54.2214577Z torch.manual_seed(2025) 2025-05-07T20:32:54.2214652Z 2025-05-07T20:32:54.2214827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2214899Z 2025-05-07T20:32:54.2214992Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2215127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2215216Z x = x_sign * x_clamp 2025-05-07T20:32:54.2215295Z x0 = x[:, :D] 2025-05-07T20:32:54.2215385Z x1 = x[:, D:] 2025-05-07T20:32:54.2215458Z 2025-05-07T20:32:54.2215541Z if contiguous: 2025-05-07T20:32:54.2215644Z x0 = x0.contiguous() 2025-05-07T20:32:54.2215736Z x1 = x1.contiguous() 2025-05-07T20:32:54.2215813Z 2025-05-07T20:32:54.2215912Z if scale_ub is not None: 2025-05-07T20:32:54.2216019Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2216161Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2216236Z ) 2025-05-07T20:32:54.2216314Z else: 2025-05-07T20:32:54.2216413Z scale_ub_tensor = None 2025-05-07T20:32:54.2216486Z 2025-05-07T20:32:54.2216615Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2216712Z op = silu_mul_quant 2025-05-07T20:32:54.2216802Z if compiled: 2025-05-07T20:32:54.2216901Z op = torch.compile(op) 2025-05-07T20:32:54.2217013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2217085Z 2025-05-07T20:32:54.2217177Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2217181Z 2025-05-07T20:32:54.2217287Z moe/activation_test.py:117: 2025-05-07T20:32:54.2217416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2217528Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2217635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2218128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2218234Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2218587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2218808Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2219152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2219245Z kernel = self.compile( 2025-05-07T20:32:54.2219634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2219855Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2220029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2220034Z 2025-05-07T20:32:54.2220248Z self = 2025-05-07T20:32:54.2221014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2221559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fb00fe0>} 2025-05-07T20:32:54.2222290Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2222491Z context = 2025-05-07T20:32:54.2222495Z 2025-05-07T20:32:54.2222715Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2222972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2223087Z module_map=module_map) 2025-05-07T20:32:54.2223250Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2223350Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2223439Z E ^ 2025-05-07T20:32:54.2223789Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2223793Z 2025-05-07T20:32:54.2224208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2224215Z 2025-05-07T20:32:54.2224316Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2224537Z self=, 2025-05-07T20:32:54.2224625Z T=128, 2025-05-07T20:32:54.2224702Z D=5120, 2025-05-07T20:32:54.2224784Z scale_ub=None, 2025-05-07T20:32:54.2224882Z contiguous=False, 2025-05-07T20:32:54.2224967Z compiled=True, 2025-05-07T20:32:54.2225045Z ) 2025-05-07T20:32:54.2225270Z self = 2025-05-07T20:32:54.2225441Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2225449Z 2025-05-07T20:32:54.2225539Z @given( 2025-05-07T20:32:54.2225660Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2225763Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2225888Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2226004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2226134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2226208Z ) 2025-05-07T20:32:54.2226453Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2226555Z def test_silu_mul_quant( 2025-05-07T20:32:54.2226634Z self, 2025-05-07T20:32:54.2226709Z T: int, 2025-05-07T20:32:54.2226795Z D: int, 2025-05-07T20:32:54.2226894Z scale_ub: Optional[float], 2025-05-07T20:32:54.2226988Z contiguous: bool, 2025-05-07T20:32:54.2227072Z compiled: bool, 2025-05-07T20:32:54.2227153Z ) -> None: 2025-05-07T20:32:54.2227253Z torch.manual_seed(2025) 2025-05-07T20:32:54.2227328Z 2025-05-07T20:32:54.2227493Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2227574Z 2025-05-07T20:32:54.2227666Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2227788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2227932Z x = x_sign * x_clamp 2025-05-07T20:32:54.2228010Z x0 = x[:, :D] 2025-05-07T20:32:54.2228088Z x1 = x[:, D:] 2025-05-07T20:32:54.2228207Z 2025-05-07T20:32:54.2228294Z if contiguous: 2025-05-07T20:32:54.2228386Z x0 = x0.contiguous() 2025-05-07T20:32:54.2228481Z x1 = x1.contiguous() 2025-05-07T20:32:54.2228558Z 2025-05-07T20:32:54.2228654Z if scale_ub is not None: 2025-05-07T20:32:54.2228761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2228894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2229013Z ) 2025-05-07T20:32:54.2229088Z else: 2025-05-07T20:32:54.2229184Z scale_ub_tensor = None 2025-05-07T20:32:54.2229262Z 2025-05-07T20:32:54.2229390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2229478Z op = silu_mul_quant 2025-05-07T20:32:54.2229565Z if compiled: 2025-05-07T20:32:54.2229665Z op = torch.compile(op) 2025-05-07T20:32:54.2229769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2229848Z 2025-05-07T20:32:54.2230009Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2230014Z 2025-05-07T20:32:54.2230117Z moe/activation_test.py:117: 2025-05-07T20:32:54.2230244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2230346Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2230449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2230814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2230907Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2231397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2231492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2231849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2232076Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2232411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2232508Z kernel = self.compile( 2025-05-07T20:32:54.2232915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2233103Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2233238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2233243Z 2025-05-07T20:32:54.2233444Z self = 2025-05-07T20:32:54.2234208Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2234709Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a9db1a0>} 2025-05-07T20:32:54.2235444Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2235635Z context = 2025-05-07T20:32:54.2235639Z 2025-05-07T20:32:54.2235805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2236069Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2236176Z module_map=module_map) 2025-05-07T20:32:54.2236397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2236501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2236615Z E ^ 2025-05-07T20:32:54.2236976Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2236981Z 2025-05-07T20:32:54.2237389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2237394Z 2025-05-07T20:32:54.2237497Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2237765Z self=, 2025-05-07T20:32:54.2237844Z T=128, 2025-05-07T20:32:54.2237923Z D=7168, 2025-05-07T20:32:54.2238013Z scale_ub=1200.0, 2025-05-07T20:32:54.2238099Z contiguous=False, 2025-05-07T20:32:54.2238190Z compiled=False, 2025-05-07T20:32:54.2238262Z ) 2025-05-07T20:32:54.2238476Z self = 2025-05-07T20:32:54.2238657Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.2238664Z 2025-05-07T20:32:54.2238782Z @given( 2025-05-07T20:32:54.2238901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2239007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2239123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2239238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2239359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2239434Z ) 2025-05-07T20:32:54.2239684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2239773Z def test_silu_mul_quant( 2025-05-07T20:32:54.2239849Z self, 2025-05-07T20:32:54.2239934Z T: int, 2025-05-07T20:32:54.2240010Z D: int, 2025-05-07T20:32:54.2240107Z scale_ub: Optional[float], 2025-05-07T20:32:54.2240203Z contiguous: bool, 2025-05-07T20:32:54.2240287Z compiled: bool, 2025-05-07T20:32:54.2240366Z ) -> None: 2025-05-07T20:32:54.2240471Z torch.manual_seed(2025) 2025-05-07T20:32:54.2240542Z 2025-05-07T20:32:54.2240710Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2240790Z 2025-05-07T20:32:54.2240881Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2241009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2241097Z x = x_sign * x_clamp 2025-05-07T20:32:54.2241180Z x0 = x[:, :D] 2025-05-07T20:32:54.2241262Z x1 = x[:, D:] 2025-05-07T20:32:54.2241334Z 2025-05-07T20:32:54.2241416Z if contiguous: 2025-05-07T20:32:54.2241512Z x0 = x0.contiguous() 2025-05-07T20:32:54.2241603Z x1 = x1.contiguous() 2025-05-07T20:32:54.2241674Z 2025-05-07T20:32:54.2241769Z if scale_ub is not None: 2025-05-07T20:32:54.2241879Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2242012Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2242092Z ) 2025-05-07T20:32:54.2242170Z else: 2025-05-07T20:32:54.2242261Z scale_ub_tensor = None 2025-05-07T20:32:54.2242337Z 2025-05-07T20:32:54.2242465Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2242560Z op = silu_mul_quant 2025-05-07T20:32:54.2242642Z if compiled: 2025-05-07T20:32:54.2242739Z op = torch.compile(op) 2025-05-07T20:32:54.2242851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2242921Z 2025-05-07T20:32:54.2243010Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2243015Z 2025-05-07T20:32:54.2243117Z moe/activation_test.py:117: 2025-05-07T20:32:54.2243246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2243356Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2243538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2244085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2244190Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2244543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2244761Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2245103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2245236Z kernel = self.compile( 2025-05-07T20:32:54.2245618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2245788Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2245914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2245921Z 2025-05-07T20:32:54.2246170Z self = 2025-05-07T20:32:54.2246933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2247435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88b3c80e0>} 2025-05-07T20:32:54.2248169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2248357Z context = 2025-05-07T20:32:54.2248364Z 2025-05-07T20:32:54.2248531Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2248791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2248903Z module_map=module_map) 2025-05-07T20:32:54.2249063Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2249161Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2249241Z E ^ 2025-05-07T20:32:54.2249586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2249594Z 2025-05-07T20:32:54.2249999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2250010Z 2025-05-07T20:32:54.2250110Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2250329Z self=, 2025-05-07T20:32:54.2250412Z T=128, 2025-05-07T20:32:54.2250493Z D=5120, 2025-05-07T20:32:54.2250572Z scale_ub=None, 2025-05-07T20:32:54.2250667Z contiguous=False, 2025-05-07T20:32:54.2250749Z compiled=False, 2025-05-07T20:32:54.2250822Z ) 2025-05-07T20:32:54.2251042Z self = 2025-05-07T20:32:54.2251209Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2251214Z 2025-05-07T20:32:54.2251298Z @given( 2025-05-07T20:32:54.2251419Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2251516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2251636Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2251750Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2251863Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2251944Z ) 2025-05-07T20:32:54.2252227Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2252323Z def test_silu_mul_quant( 2025-05-07T20:32:54.2252443Z self, 2025-05-07T20:32:54.2252525Z T: int, 2025-05-07T20:32:54.2252603Z D: int, 2025-05-07T20:32:54.2252714Z scale_ub: Optional[float], 2025-05-07T20:32:54.2252805Z contiguous: bool, 2025-05-07T20:32:54.2252898Z compiled: bool, 2025-05-07T20:32:54.2253094Z ) -> None: 2025-05-07T20:32:54.2253189Z torch.manual_seed(2025) 2025-05-07T20:32:54.2253270Z 2025-05-07T20:32:54.2253485Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2253559Z 2025-05-07T20:32:54.2253673Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2253819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2253925Z x = x_sign * x_clamp 2025-05-07T20:32:54.2254013Z x0 = x[:, :D] 2025-05-07T20:32:54.2254094Z x1 = x[:, D:] 2025-05-07T20:32:54.2254168Z 2025-05-07T20:32:54.2254257Z if contiguous: 2025-05-07T20:32:54.2254350Z x0 = x0.contiguous() 2025-05-07T20:32:54.2254488Z x1 = x1.contiguous() 2025-05-07T20:32:54.2254564Z 2025-05-07T20:32:54.2254655Z if scale_ub is not None: 2025-05-07T20:32:54.2254766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2254900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2254974Z ) 2025-05-07T20:32:54.2255054Z else: 2025-05-07T20:32:54.2255148Z scale_ub_tensor = None 2025-05-07T20:32:54.2255225Z 2025-05-07T20:32:54.2255358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2255447Z op = silu_mul_quant 2025-05-07T20:32:54.2255530Z if compiled: 2025-05-07T20:32:54.2255635Z op = torch.compile(op) 2025-05-07T20:32:54.2255738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2255820Z 2025-05-07T20:32:54.2255911Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2255915Z 2025-05-07T20:32:54.2256008Z moe/activation_test.py:117: 2025-05-07T20:32:54.2256146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2256246Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2256343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2256837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2256935Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2257295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2257513Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2257846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2257945Z kernel = self.compile( 2025-05-07T20:32:54.2258321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2258494Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2258625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2258629Z 2025-05-07T20:32:54.2258831Z self = 2025-05-07T20:32:54.2263377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2264005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88b3bea20>} 2025-05-07T20:32:54.2265197Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2265398Z context = 2025-05-07T20:32:54.2265405Z 2025-05-07T20:32:54.2265575Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2265851Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2266050Z module_map=module_map) 2025-05-07T20:32:54.2266216Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2266325Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2266405Z E ^ 2025-05-07T20:32:54.2266769Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2266778Z 2025-05-07T20:32:54.2267196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2267268Z 2025-05-07T20:32:54.2267375Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2267608Z self=, 2025-05-07T20:32:54.2267684Z T=128, 2025-05-07T20:32:54.2267764Z D=5120, 2025-05-07T20:32:54.2267849Z scale_ub=1200.0, 2025-05-07T20:32:54.2267935Z contiguous=True, 2025-05-07T20:32:54.2268026Z compiled=False, 2025-05-07T20:32:54.2268106Z ) 2025-05-07T20:32:54.2268325Z self = 2025-05-07T20:32:54.2268502Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2268507Z 2025-05-07T20:32:54.2268581Z @given( 2025-05-07T20:32:54.2268702Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2268813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2268930Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2269060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2269175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2269246Z ) 2025-05-07T20:32:54.2269499Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2269593Z def test_silu_mul_quant( 2025-05-07T20:32:54.2269672Z self, 2025-05-07T20:32:54.2269755Z T: int, 2025-05-07T20:32:54.2269840Z D: int, 2025-05-07T20:32:54.2269939Z scale_ub: Optional[float], 2025-05-07T20:32:54.2270037Z contiguous: bool, 2025-05-07T20:32:54.2270124Z compiled: bool, 2025-05-07T20:32:54.2270208Z ) -> None: 2025-05-07T20:32:54.2270311Z torch.manual_seed(2025) 2025-05-07T20:32:54.2270385Z 2025-05-07T20:32:54.2270564Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2270641Z 2025-05-07T20:32:54.2270734Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2270870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2270961Z x = x_sign * x_clamp 2025-05-07T20:32:54.2271040Z x0 = x[:, :D] 2025-05-07T20:32:54.2271125Z x1 = x[:, D:] 2025-05-07T20:32:54.2271196Z 2025-05-07T20:32:54.2271278Z if contiguous: 2025-05-07T20:32:54.2271379Z x0 = x0.contiguous() 2025-05-07T20:32:54.2271467Z x1 = x1.contiguous() 2025-05-07T20:32:54.2271537Z 2025-05-07T20:32:54.2271637Z if scale_ub is not None: 2025-05-07T20:32:54.2271743Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2271878Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2271957Z ) 2025-05-07T20:32:54.2272031Z else: 2025-05-07T20:32:54.2272134Z scale_ub_tensor = None 2025-05-07T20:32:54.2272206Z 2025-05-07T20:32:54.2272386Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2272481Z op = silu_mul_quant 2025-05-07T20:32:54.2272607Z if compiled: 2025-05-07T20:32:54.2272711Z op = torch.compile(op) 2025-05-07T20:32:54.2272826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2272899Z 2025-05-07T20:32:54.2272991Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2272996Z 2025-05-07T20:32:54.2273098Z moe/activation_test.py:117: 2025-05-07T20:32:54.2273227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2273380Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2273481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2273979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2274084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2274440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2274704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2275049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2275143Z kernel = self.compile( 2025-05-07T20:32:54.2275526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2275699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2275828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2275834Z 2025-05-07T20:32:54.2276044Z self = 2025-05-07T20:32:54.2276815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2277328Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fb39120>} 2025-05-07T20:32:54.2278062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2278255Z context = 2025-05-07T20:32:54.2278266Z 2025-05-07T20:32:54.2278431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2278690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2278805Z module_map=module_map) 2025-05-07T20:32:54.2278968Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2279064Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2279151Z E ^ 2025-05-07T20:32:54.2279504Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2279509Z 2025-05-07T20:32:54.2279924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2279929Z 2025-05-07T20:32:54.2280032Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2280255Z self=, 2025-05-07T20:32:54.2280335Z T=1, 2025-05-07T20:32:54.2280414Z D=7168, 2025-05-07T20:32:54.2280495Z scale_ub=1200.0, 2025-05-07T20:32:54.2280589Z contiguous=True, 2025-05-07T20:32:54.2280676Z compiled=True, 2025-05-07T20:32:54.2280746Z ) 2025-05-07T20:32:54.2280970Z self = 2025-05-07T20:32:54.2281177Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.2281249Z 2025-05-07T20:32:54.2281335Z @given( 2025-05-07T20:32:54.2281454Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2281551Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2281675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2281791Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2281906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2282030Z ) 2025-05-07T20:32:54.2282274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2282365Z def test_silu_mul_quant( 2025-05-07T20:32:54.2282446Z self, 2025-05-07T20:32:54.2282519Z T: int, 2025-05-07T20:32:54.2282601Z D: int, 2025-05-07T20:32:54.2282720Z scale_ub: Optional[float], 2025-05-07T20:32:54.2282822Z contiguous: bool, 2025-05-07T20:32:54.2282926Z compiled: bool, 2025-05-07T20:32:54.2283006Z ) -> None: 2025-05-07T20:32:54.2283145Z torch.manual_seed(2025) 2025-05-07T20:32:54.2283228Z 2025-05-07T20:32:54.2283397Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2283471Z 2025-05-07T20:32:54.2283570Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2283697Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2283787Z x = x_sign * x_clamp 2025-05-07T20:32:54.2283877Z x0 = x[:, :D] 2025-05-07T20:32:54.2283958Z x1 = x[:, D:] 2025-05-07T20:32:54.2284040Z 2025-05-07T20:32:54.2284126Z if contiguous: 2025-05-07T20:32:54.2284219Z x0 = x0.contiguous() 2025-05-07T20:32:54.2284316Z x1 = x1.contiguous() 2025-05-07T20:32:54.2284390Z 2025-05-07T20:32:54.2284484Z if scale_ub is not None: 2025-05-07T20:32:54.2284599Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2284736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2284817Z ) 2025-05-07T20:32:54.2284904Z else: 2025-05-07T20:32:54.2284997Z scale_ub_tensor = None 2025-05-07T20:32:54.2285071Z 2025-05-07T20:32:54.2285207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2285300Z op = silu_mul_quant 2025-05-07T20:32:54.2285385Z if compiled: 2025-05-07T20:32:54.2285493Z op = torch.compile(op) 2025-05-07T20:32:54.2285601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2285688Z 2025-05-07T20:32:54.2285782Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2285786Z 2025-05-07T20:32:54.2285885Z moe/activation_test.py:117: 2025-05-07T20:32:54.2286022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2286124Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2286226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2286603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2286697Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2287197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2287296Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2287650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2287883Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2288218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2288313Z kernel = self.compile( 2025-05-07T20:32:54.2288698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2288921Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2289093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2289099Z 2025-05-07T20:32:54.2289306Z self = 2025-05-07T20:32:54.2290073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2290646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fb3a2a0>} 2025-05-07T20:32:54.2291380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2291616Z context = 2025-05-07T20:32:54.2291621Z 2025-05-07T20:32:54.2291786Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2292049Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2292155Z module_map=module_map) 2025-05-07T20:32:54.2292316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2292424Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2292500Z E ^ 2025-05-07T20:32:54.2292850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2292855Z 2025-05-07T20:32:54.2293386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2293393Z 2025-05-07T20:32:54.2293496Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2293728Z self=, 2025-05-07T20:32:54.2293804Z T=1, 2025-05-07T20:32:54.2293880Z D=7168, 2025-05-07T20:32:54.2293968Z scale_ub=1200.0, 2025-05-07T20:32:54.2294054Z contiguous=False, 2025-05-07T20:32:54.2294138Z compiled=True, 2025-05-07T20:32:54.2294218Z ) 2025-05-07T20:32:54.2294434Z self = 2025-05-07T20:32:54.2294601Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2294612Z 2025-05-07T20:32:54.2294689Z @given( 2025-05-07T20:32:54.2294809Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2294913Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2295029Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2295148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2295268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2295344Z ) 2025-05-07T20:32:54.2295591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2295690Z def test_silu_mul_quant( 2025-05-07T20:32:54.2295772Z self, 2025-05-07T20:32:54.2295849Z T: int, 2025-05-07T20:32:54.2295934Z D: int, 2025-05-07T20:32:54.2296034Z scale_ub: Optional[float], 2025-05-07T20:32:54.2296128Z contiguous: bool, 2025-05-07T20:32:54.2296216Z compiled: bool, 2025-05-07T20:32:54.2296295Z ) -> None: 2025-05-07T20:32:54.2296397Z torch.manual_seed(2025) 2025-05-07T20:32:54.2296470Z 2025-05-07T20:32:54.2296637Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2296719Z 2025-05-07T20:32:54.2296813Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2296989Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2297084Z x = x_sign * x_clamp 2025-05-07T20:32:54.2297165Z x0 = x[:, :D] 2025-05-07T20:32:54.2297285Z x1 = x[:, D:] 2025-05-07T20:32:54.2297367Z 2025-05-07T20:32:54.2297453Z if contiguous: 2025-05-07T20:32:54.2297550Z x0 = x0.contiguous() 2025-05-07T20:32:54.2297639Z x1 = x1.contiguous() 2025-05-07T20:32:54.2297715Z 2025-05-07T20:32:54.2297812Z if scale_ub is not None: 2025-05-07T20:32:54.2297917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2298095Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2298175Z ) 2025-05-07T20:32:54.2298251Z else: 2025-05-07T20:32:54.2298344Z scale_ub_tensor = None 2025-05-07T20:32:54.2303302Z 2025-05-07T20:32:54.2303459Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2303552Z op = silu_mul_quant 2025-05-07T20:32:54.2303651Z if compiled: 2025-05-07T20:32:54.2303753Z op = torch.compile(op) 2025-05-07T20:32:54.2303866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2304011Z 2025-05-07T20:32:54.2304104Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2304110Z 2025-05-07T20:32:54.2304217Z moe/activation_test.py:117: 2025-05-07T20:32:54.2304350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2304452Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2304562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2304940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2305034Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2305532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2305631Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2306002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2306229Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2306565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2306665Z kernel = self.compile( 2025-05-07T20:32:54.2307046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2307226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2307362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2307367Z 2025-05-07T20:32:54.2307578Z self = 2025-05-07T20:32:54.2308357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2308861Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79fb3b9c0>} 2025-05-07T20:32:54.2309605Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2309800Z context = 2025-05-07T20:32:54.2309804Z 2025-05-07T20:32:54.2309969Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2310232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2310386Z module_map=module_map) 2025-05-07T20:32:54.2310546Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2310699Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2310777Z E ^ 2025-05-07T20:32:54.2311134Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2311139Z 2025-05-07T20:32:54.2311546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2311591Z 2025-05-07T20:32:54.2311694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2311921Z self=, 2025-05-07T20:32:54.2311999Z T=1, 2025-05-07T20:32:54.2312079Z D=7168, 2025-05-07T20:32:54.2312164Z scale_ub=None, 2025-05-07T20:32:54.2312250Z contiguous=False, 2025-05-07T20:32:54.2312337Z compiled=True, 2025-05-07T20:32:54.2312413Z ) 2025-05-07T20:32:54.2312628Z self = 2025-05-07T20:32:54.2312840Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2312845Z 2025-05-07T20:32:54.2312927Z @given( 2025-05-07T20:32:54.2313047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2313154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2313270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2313393Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2313511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2313584Z ) 2025-05-07T20:32:54.2313834Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2313929Z def test_silu_mul_quant( 2025-05-07T20:32:54.2314011Z self, 2025-05-07T20:32:54.2314096Z T: int, 2025-05-07T20:32:54.2314174Z D: int, 2025-05-07T20:32:54.2314275Z scale_ub: Optional[float], 2025-05-07T20:32:54.2314370Z contiguous: bool, 2025-05-07T20:32:54.2314457Z compiled: bool, 2025-05-07T20:32:54.2314541Z ) -> None: 2025-05-07T20:32:54.2314642Z torch.manual_seed(2025) 2025-05-07T20:32:54.2314716Z 2025-05-07T20:32:54.2314894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2314969Z 2025-05-07T20:32:54.2315062Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2315194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2315285Z x = x_sign * x_clamp 2025-05-07T20:32:54.2315363Z x0 = x[:, :D] 2025-05-07T20:32:54.2315450Z x1 = x[:, D:] 2025-05-07T20:32:54.2315522Z 2025-05-07T20:32:54.2315603Z if contiguous: 2025-05-07T20:32:54.2315703Z x0 = x0.contiguous() 2025-05-07T20:32:54.2315791Z x1 = x1.contiguous() 2025-05-07T20:32:54.2315863Z 2025-05-07T20:32:54.2315964Z if scale_ub is not None: 2025-05-07T20:32:54.2316068Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2316207Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2316291Z ) 2025-05-07T20:32:54.2316369Z else: 2025-05-07T20:32:54.2316472Z scale_ub_tensor = None 2025-05-07T20:32:54.2316544Z 2025-05-07T20:32:54.2316673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2316768Z op = silu_mul_quant 2025-05-07T20:32:54.2316853Z if compiled: 2025-05-07T20:32:54.2316958Z op = torch.compile(op) 2025-05-07T20:32:54.2317072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2317145Z 2025-05-07T20:32:54.2317237Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.2317364Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.2317439Z 2025-05-07T20:32:54.2317575Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2317734Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.2317835Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.2318005Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.2318150Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2318224Z 2025-05-07T20:32:54.2318331Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.2318336Z 2025-05-07T20:32:54.2318436Z moe/activation_test.py:126: 2025-05-07T20:32:54.2318567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2318721Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.2318855Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.2319420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.2319521Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.2319879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2320146Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2320511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.2320765Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.2321145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.2321314Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.2321658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.2321737Z fn() 2025-05-07T20:32:54.2322132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.2322224Z self.fn.run( 2025-05-07T20:32:54.2322562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2322665Z kernel = self.compile( 2025-05-07T20:32:54.2323039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2323211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2323345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2323353Z 2025-05-07T20:32:54.2323558Z self = 2025-05-07T20:32:54.2324326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2324838Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f8ccb80>} 2025-05-07T20:32:54.2325573Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2325771Z context = 2025-05-07T20:32:54.2325778Z 2025-05-07T20:32:54.2325943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2326206Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2326313Z module_map=module_map) 2025-05-07T20:32:54.2326480Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2326636Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.2326713Z E ^ 2025-05-07T20:32:54.2327129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2327142Z 2025-05-07T20:32:54.2327553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2327558Z 2025-05-07T20:32:54.2327658Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2327889Z self=, 2025-05-07T20:32:54.2328007Z T=1, 2025-05-07T20:32:54.2328085Z D=5120, 2025-05-07T20:32:54.2328178Z scale_ub=1200.0, 2025-05-07T20:32:54.2328264Z contiguous=False, 2025-05-07T20:32:54.2328347Z compiled=True, 2025-05-07T20:32:54.2328433Z ) 2025-05-07T20:32:54.2328648Z self = 2025-05-07T20:32:54.2328825Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2328830Z 2025-05-07T20:32:54.2328905Z @given( 2025-05-07T20:32:54.2329067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2329176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2329291Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2329410Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2329535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2329610Z ) 2025-05-07T20:32:54.2329853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2329957Z def test_silu_mul_quant( 2025-05-07T20:32:54.2330033Z self, 2025-05-07T20:32:54.2330117Z T: int, 2025-05-07T20:32:54.2330195Z D: int, 2025-05-07T20:32:54.2330295Z scale_ub: Optional[float], 2025-05-07T20:32:54.2330389Z contiguous: bool, 2025-05-07T20:32:54.2330475Z compiled: bool, 2025-05-07T20:32:54.2330553Z ) -> None: 2025-05-07T20:32:54.2330656Z torch.manual_seed(2025) 2025-05-07T20:32:54.2330731Z 2025-05-07T20:32:54.2330903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2330985Z 2025-05-07T20:32:54.2331078Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2331202Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2331297Z x = x_sign * x_clamp 2025-05-07T20:32:54.2331376Z x0 = x[:, :D] 2025-05-07T20:32:54.2331453Z x1 = x[:, D:] 2025-05-07T20:32:54.2331539Z 2025-05-07T20:32:54.2331622Z if contiguous: 2025-05-07T20:32:54.2331719Z x0 = x0.contiguous() 2025-05-07T20:32:54.2331807Z x1 = x1.contiguous() 2025-05-07T20:32:54.2331886Z 2025-05-07T20:32:54.2331984Z if scale_ub is not None: 2025-05-07T20:32:54.2332090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2332223Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2332309Z ) 2025-05-07T20:32:54.2332384Z else: 2025-05-07T20:32:54.2332481Z scale_ub_tensor = None 2025-05-07T20:32:54.2332569Z 2025-05-07T20:32:54.2332699Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2332793Z op = silu_mul_quant 2025-05-07T20:32:54.2332907Z if compiled: 2025-05-07T20:32:54.2333103Z op = torch.compile(op) 2025-05-07T20:32:54.2333221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2333293Z 2025-05-07T20:32:54.2333389Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2333394Z 2025-05-07T20:32:54.2333497Z moe/activation_test.py:117: 2025-05-07T20:32:54.2333627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2333732Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2333839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2334253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2334358Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2334885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2334984Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2335343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2335564Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2335939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2336041Z kernel = self.compile( 2025-05-07T20:32:54.2336415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2336600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2336731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2336788Z 2025-05-07T20:32:54.2336995Z self = 2025-05-07T20:32:54.2337769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2338271Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f8cde40>} 2025-05-07T20:32:54.2339012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2339205Z context = 2025-05-07T20:32:54.2339210Z 2025-05-07T20:32:54.2339379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2339645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2339755Z module_map=module_map) 2025-05-07T20:32:54.2339922Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2340025Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2340106Z E ^ 2025-05-07T20:32:54.2340465Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2340470Z 2025-05-07T20:32:54.2340877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2340882Z 2025-05-07T20:32:54.2340996Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2341216Z self=, 2025-05-07T20:32:54.2341296Z T=1, 2025-05-07T20:32:54.2341389Z D=5120, 2025-05-07T20:32:54.2341472Z scale_ub=1200.0, 2025-05-07T20:32:54.2341559Z contiguous=False, 2025-05-07T20:32:54.2341652Z compiled=False, 2025-05-07T20:32:54.2341725Z ) 2025-05-07T20:32:54.2341940Z self = 2025-05-07T20:32:54.2342114Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.2342121Z 2025-05-07T20:32:54.2342199Z @given( 2025-05-07T20:32:54.2342325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2342424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2342537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2342658Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2342820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2342897Z ) 2025-05-07T20:32:54.2343189Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2343283Z def test_silu_mul_quant( 2025-05-07T20:32:54.2343357Z self, 2025-05-07T20:32:54.2343441Z T: int, 2025-05-07T20:32:54.2343514Z D: int, 2025-05-07T20:32:54.2343619Z scale_ub: Optional[float], 2025-05-07T20:32:54.2343709Z contiguous: bool, 2025-05-07T20:32:54.2343793Z compiled: bool, 2025-05-07T20:32:54.2343879Z ) -> None: 2025-05-07T20:32:54.2344019Z torch.manual_seed(2025) 2025-05-07T20:32:54.2344095Z 2025-05-07T20:32:54.2344267Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2344339Z 2025-05-07T20:32:54.2344432Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2344565Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2344653Z x = x_sign * x_clamp 2025-05-07T20:32:54.2344735Z x0 = x[:, :D] 2025-05-07T20:32:54.2344819Z x1 = x[:, D:] 2025-05-07T20:32:54.2344890Z 2025-05-07T20:32:54.2345016Z if contiguous: 2025-05-07T20:32:54.2345116Z x0 = x0.contiguous() 2025-05-07T20:32:54.2345205Z x1 = x1.contiguous() 2025-05-07T20:32:54.2345284Z 2025-05-07T20:32:54.2345375Z if scale_ub is not None: 2025-05-07T20:32:54.2345480Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2345618Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2345696Z ) 2025-05-07T20:32:54.2345772Z else: 2025-05-07T20:32:54.2345873Z scale_ub_tensor = None 2025-05-07T20:32:54.2345950Z 2025-05-07T20:32:54.2346079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2346175Z op = silu_mul_quant 2025-05-07T20:32:54.2346259Z if compiled: 2025-05-07T20:32:54.2346359Z op = torch.compile(op) 2025-05-07T20:32:54.2346472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2346543Z 2025-05-07T20:32:54.2346645Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2346652Z 2025-05-07T20:32:54.2346749Z moe/activation_test.py:117: 2025-05-07T20:32:54.2346877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2346982Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2347085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2347579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2347688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2348042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2348271Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2348610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2348705Z kernel = self.compile( 2025-05-07T20:32:54.2349094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2349269Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2349395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2349408Z 2025-05-07T20:32:54.2349615Z self = 2025-05-07T20:32:54.2350379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2350882Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f8ceac0>} 2025-05-07T20:32:54.2351707Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2351907Z context = 2025-05-07T20:32:54.2351911Z 2025-05-07T20:32:54.2352073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2352373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2352487Z module_map=module_map) 2025-05-07T20:32:54.2352648Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2352758Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2352833Z E ^ 2025-05-07T20:32:54.2353181Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2353189Z 2025-05-07T20:32:54.2354286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2354292Z 2025-05-07T20:32:54.2354397Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2354621Z self=, 2025-05-07T20:32:54.2354706Z T=16384, 2025-05-07T20:32:54.2354783Z D=5120, 2025-05-07T20:32:54.2354875Z scale_ub=1200.0, 2025-05-07T20:32:54.2354964Z contiguous=False, 2025-05-07T20:32:54.2355047Z compiled=True, 2025-05-07T20:32:54.2355130Z ) 2025-05-07T20:32:54.2355348Z self = 2025-05-07T20:32:54.2355533Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2355538Z 2025-05-07T20:32:54.2355628Z @given( 2025-05-07T20:32:54.2355750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2355851Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2355986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2356097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2356216Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2356289Z ) 2025-05-07T20:32:54.2356533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2356633Z def test_silu_mul_quant( 2025-05-07T20:32:54.2356710Z self, 2025-05-07T20:32:54.2356794Z T: int, 2025-05-07T20:32:54.2356870Z D: int, 2025-05-07T20:32:54.2356966Z scale_ub: Optional[float], 2025-05-07T20:32:54.2357062Z contiguous: bool, 2025-05-07T20:32:54.2357148Z compiled: bool, 2025-05-07T20:32:54.2357227Z ) -> None: 2025-05-07T20:32:54.2357325Z torch.manual_seed(2025) 2025-05-07T20:32:54.2357403Z 2025-05-07T20:32:54.2357570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2357649Z 2025-05-07T20:32:54.2357746Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2357872Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2357965Z x = x_sign * x_clamp 2025-05-07T20:32:54.2358047Z x0 = x[:, :D] 2025-05-07T20:32:54.2358134Z x1 = x[:, D:] 2025-05-07T20:32:54.2358208Z 2025-05-07T20:32:54.2358290Z if contiguous: 2025-05-07T20:32:54.2358385Z x0 = x0.contiguous() 2025-05-07T20:32:54.2358475Z x1 = x1.contiguous() 2025-05-07T20:32:54.2358549Z 2025-05-07T20:32:54.2358647Z if scale_ub is not None: 2025-05-07T20:32:54.2358750Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2358884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2358962Z ) 2025-05-07T20:32:54.2359038Z else: 2025-05-07T20:32:54.2359436Z scale_ub_tensor = None 2025-05-07T20:32:54.2359558Z 2025-05-07T20:32:54.2359857Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2359950Z op = silu_mul_quant 2025-05-07T20:32:54.2360039Z if compiled: 2025-05-07T20:32:54.2360137Z op = torch.compile(op) 2025-05-07T20:32:54.2360246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2360318Z 2025-05-07T20:32:54.2360405Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2360410Z 2025-05-07T20:32:54.2360512Z moe/activation_test.py:117: 2025-05-07T20:32:54.2360705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2360805Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2360910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2361273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2361377Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2361927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2362027Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2362382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2362602Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2362933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2363037Z kernel = self.compile( 2025-05-07T20:32:54.2363410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2363587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2363711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2363718Z 2025-05-07T20:32:54.2363925Z self = 2025-05-07T20:32:54.2364696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2365196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ff4c180>} 2025-05-07T20:32:54.2365940Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2366128Z context = 2025-05-07T20:32:54.2366135Z 2025-05-07T20:32:54.2366306Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2366567Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2366674Z module_map=module_map) 2025-05-07T20:32:54.2366849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2366948Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2367024Z E ^ 2025-05-07T20:32:54.2367378Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2367385Z 2025-05-07T20:32:54.2367790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2367795Z 2025-05-07T20:32:54.2367898Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2368125Z self=, 2025-05-07T20:32:54.2368266Z T=2048, 2025-05-07T20:32:54.2368348Z D=7168, 2025-05-07T20:32:54.2368432Z scale_ub=1200.0, 2025-05-07T20:32:54.2368559Z contiguous=False, 2025-05-07T20:32:54.2368647Z compiled=True, 2025-05-07T20:32:54.2368722Z ) 2025-05-07T20:32:54.2368936Z self = 2025-05-07T20:32:54.2369115Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2369120Z 2025-05-07T20:32:54.2369195Z @given( 2025-05-07T20:32:54.2369314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2369461Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2369575Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2369700Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2369814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2369886Z ) 2025-05-07T20:32:54.2370136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2370230Z def test_silu_mul_quant( 2025-05-07T20:32:54.2370310Z self, 2025-05-07T20:32:54.2370436Z T: int, 2025-05-07T20:32:54.2370514Z D: int, 2025-05-07T20:32:54.2370611Z scale_ub: Optional[float], 2025-05-07T20:32:54.2370706Z contiguous: bool, 2025-05-07T20:32:54.2370793Z compiled: bool, 2025-05-07T20:32:54.2370870Z ) -> None: 2025-05-07T20:32:54.2370977Z torch.manual_seed(2025) 2025-05-07T20:32:54.2371047Z 2025-05-07T20:32:54.2371224Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2371300Z 2025-05-07T20:32:54.2371393Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2371523Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2371610Z x = x_sign * x_clamp 2025-05-07T20:32:54.2371690Z x0 = x[:, :D] 2025-05-07T20:32:54.2371774Z x1 = x[:, D:] 2025-05-07T20:32:54.2371851Z 2025-05-07T20:32:54.2371936Z if contiguous: 2025-05-07T20:32:54.2372032Z x0 = x0.contiguous() 2025-05-07T20:32:54.2372129Z x1 = x1.contiguous() 2025-05-07T20:32:54.2372201Z 2025-05-07T20:32:54.2372296Z if scale_ub is not None: 2025-05-07T20:32:54.2372400Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2372538Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2372610Z ) 2025-05-07T20:32:54.2372687Z else: 2025-05-07T20:32:54.2372793Z scale_ub_tensor = None 2025-05-07T20:32:54.2372880Z 2025-05-07T20:32:54.2373106Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2373204Z op = silu_mul_quant 2025-05-07T20:32:54.2373288Z if compiled: 2025-05-07T20:32:54.2373387Z op = torch.compile(op) 2025-05-07T20:32:54.2373499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2373575Z 2025-05-07T20:32:54.2373665Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2373669Z 2025-05-07T20:32:54.2373772Z moe/activation_test.py:117: 2025-05-07T20:32:54.2373904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2374012Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2374109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2374467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2374563Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2375050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2375147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2375504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2375724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2376155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2376252Z kernel = self.compile( 2025-05-07T20:32:54.2376627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2376805Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2376930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2376974Z 2025-05-07T20:32:54.2377183Z self = 2025-05-07T20:32:54.2377943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2378442Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ff4cea0>} 2025-05-07T20:32:54.2379215Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2379404Z context = 2025-05-07T20:32:54.2379409Z 2025-05-07T20:32:54.2379577Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2379837Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2379946Z module_map=module_map) 2025-05-07T20:32:54.2380109Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2380206Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2380289Z E ^ 2025-05-07T20:32:54.2380638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2380645Z 2025-05-07T20:32:54.2381050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2381054Z 2025-05-07T20:32:54.2381163Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2381381Z self=, 2025-05-07T20:32:54.2381454Z T=1, 2025-05-07T20:32:54.2381540Z D=5120, 2025-05-07T20:32:54.2381620Z scale_ub=None, 2025-05-07T20:32:54.2381712Z contiguous=False, 2025-05-07T20:32:54.2381797Z compiled=False, 2025-05-07T20:32:54.2381867Z ) 2025-05-07T20:32:54.2382094Z self = 2025-05-07T20:32:54.2382259Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2382266Z 2025-05-07T20:32:54.2382342Z @given( 2025-05-07T20:32:54.2382466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2382569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2382683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2382806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2382918Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2382998Z ) 2025-05-07T20:32:54.2383241Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2383335Z def test_silu_mul_quant( 2025-05-07T20:32:54.2383419Z self, 2025-05-07T20:32:54.2383499Z T: int, 2025-05-07T20:32:54.2383573Z D: int, 2025-05-07T20:32:54.2383678Z scale_ub: Optional[float], 2025-05-07T20:32:54.2383769Z contiguous: bool, 2025-05-07T20:32:54.2383853Z compiled: bool, 2025-05-07T20:32:54.2383936Z ) -> None: 2025-05-07T20:32:54.2384079Z torch.manual_seed(2025) 2025-05-07T20:32:54.2384152Z 2025-05-07T20:32:54.2384365Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2384443Z 2025-05-07T20:32:54.2384542Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2384668Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2384758Z x = x_sign * x_clamp 2025-05-07T20:32:54.2384843Z x0 = x[:, :D] 2025-05-07T20:32:54.2384923Z x1 = x[:, D:] 2025-05-07T20:32:54.2384998Z 2025-05-07T20:32:54.2385089Z if contiguous: 2025-05-07T20:32:54.2385222Z x0 = x0.contiguous() 2025-05-07T20:32:54.2385309Z x1 = x1.contiguous() 2025-05-07T20:32:54.2385383Z 2025-05-07T20:32:54.2385473Z if scale_ub is not None: 2025-05-07T20:32:54.2385578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2385714Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2385791Z ) 2025-05-07T20:32:54.2385872Z else: 2025-05-07T20:32:54.2385966Z scale_ub_tensor = None 2025-05-07T20:32:54.2386045Z 2025-05-07T20:32:54.2386245Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2386335Z op = silu_mul_quant 2025-05-07T20:32:54.2386418Z if compiled: 2025-05-07T20:32:54.2386527Z op = torch.compile(op) 2025-05-07T20:32:54.2386630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2386702Z 2025-05-07T20:32:54.2386800Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2386808Z 2025-05-07T20:32:54.2386903Z moe/activation_test.py:117: 2025-05-07T20:32:54.2387031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2387137Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2387236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2387733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2387831Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2388187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2388415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2388747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2388847Z kernel = self.compile( 2025-05-07T20:32:54.2389224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2389396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2389525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2389529Z 2025-05-07T20:32:54.2389730Z self = 2025-05-07T20:32:54.2390495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2390996Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ff4de40>} 2025-05-07T20:32:54.2391726Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2391920Z context = 2025-05-07T20:32:54.2391924Z 2025-05-07T20:32:54.2392086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2392391Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2392539Z module_map=module_map) 2025-05-07T20:32:54.2392702Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2392808Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2392882Z E ^ 2025-05-07T20:32:54.2393226Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2393237Z 2025-05-07T20:32:54.2393640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2393686Z 2025-05-07T20:32:54.2393785Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2394009Z self=, 2025-05-07T20:32:54.2394082Z T=4096, 2025-05-07T20:32:54.2394160Z D=7168, 2025-05-07T20:32:54.2394244Z scale_ub=1200.0, 2025-05-07T20:32:54.2394331Z contiguous=False, 2025-05-07T20:32:54.2394413Z compiled=False, 2025-05-07T20:32:54.2394488Z ) 2025-05-07T20:32:54.2394822Z self = 2025-05-07T20:32:54.2395004Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.2395008Z 2025-05-07T20:32:54.2395080Z @given( 2025-05-07T20:32:54.2395194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2395297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2395410Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2395524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2395640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2395710Z ) 2025-05-07T20:32:54.2395949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2396041Z def test_silu_mul_quant( 2025-05-07T20:32:54.2396117Z self, 2025-05-07T20:32:54.2396199Z T: int, 2025-05-07T20:32:54.2396273Z D: int, 2025-05-07T20:32:54.2396375Z scale_ub: Optional[float], 2025-05-07T20:32:54.2396471Z contiguous: bool, 2025-05-07T20:32:54.2396553Z compiled: bool, 2025-05-07T20:32:54.2396630Z ) -> None: 2025-05-07T20:32:54.2396726Z torch.manual_seed(2025) 2025-05-07T20:32:54.2396797Z 2025-05-07T20:32:54.2396961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2397036Z 2025-05-07T20:32:54.2397129Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2397252Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2397343Z x = x_sign * x_clamp 2025-05-07T20:32:54.2397423Z x0 = x[:, :D] 2025-05-07T20:32:54.2397504Z x1 = x[:, D:] 2025-05-07T20:32:54.2397572Z 2025-05-07T20:32:54.2397653Z if contiguous: 2025-05-07T20:32:54.2397744Z x0 = x0.contiguous() 2025-05-07T20:32:54.2397832Z x1 = x1.contiguous() 2025-05-07T20:32:54.2397900Z 2025-05-07T20:32:54.2397993Z if scale_ub is not None: 2025-05-07T20:32:54.2398104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2398236Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2398314Z ) 2025-05-07T20:32:54.2398384Z else: 2025-05-07T20:32:54.2398475Z scale_ub_tensor = None 2025-05-07T20:32:54.2398550Z 2025-05-07T20:32:54.2398678Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2398771Z op = silu_mul_quant 2025-05-07T20:32:54.2398860Z if compiled: 2025-05-07T20:32:54.2398956Z op = torch.compile(op) 2025-05-07T20:32:54.2399064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2399132Z 2025-05-07T20:32:54.2399221Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2399226Z 2025-05-07T20:32:54.2399326Z moe/activation_test.py:117: 2025-05-07T20:32:54.2399504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2399642Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2399746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2400235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2400335Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2400686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2400946Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2401283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2401373Z kernel = self.compile( 2025-05-07T20:32:54.2401748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2401927Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2402091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2402097Z 2025-05-07T20:32:54.2402306Z self = 2025-05-07T20:32:54.2403064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2403561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ff4f380>} 2025-05-07T20:32:54.2404296Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2404495Z context = 2025-05-07T20:32:54.2404502Z 2025-05-07T20:32:54.2404670Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2404925Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2405034Z module_map=module_map) 2025-05-07T20:32:54.2405192Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2405291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2405371Z E ^ 2025-05-07T20:32:54.2405718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2405723Z 2025-05-07T20:32:54.2406125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2406132Z 2025-05-07T20:32:54.2406237Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2406459Z self=, 2025-05-07T20:32:54.2406540Z T=16384, 2025-05-07T20:32:54.2406619Z D=7168, 2025-05-07T20:32:54.2406698Z scale_ub=None, 2025-05-07T20:32:54.2406787Z contiguous=True, 2025-05-07T20:32:54.2406867Z compiled=True, 2025-05-07T20:32:54.2406939Z ) 2025-05-07T20:32:54.2407155Z self = 2025-05-07T20:32:54.2407326Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2407331Z 2025-05-07T20:32:54.2407406Z @given( 2025-05-07T20:32:54.2407530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2407628Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2407744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2407910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2408019Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2408134Z ) 2025-05-07T20:32:54.2408379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2408470Z def test_silu_mul_quant( 2025-05-07T20:32:54.2408546Z self, 2025-05-07T20:32:54.2408621Z T: int, 2025-05-07T20:32:54.2408692Z D: int, 2025-05-07T20:32:54.2408793Z scale_ub: Optional[float], 2025-05-07T20:32:54.2408880Z contiguous: bool, 2025-05-07T20:32:54.2409003Z compiled: bool, 2025-05-07T20:32:54.2409084Z ) -> None: 2025-05-07T20:32:54.2409174Z torch.manual_seed(2025) 2025-05-07T20:32:54.2409250Z 2025-05-07T20:32:54.2409415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2409487Z 2025-05-07T20:32:54.2409582Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2409708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2409797Z x = x_sign * x_clamp 2025-05-07T20:32:54.2409878Z x0 = x[:, :D] 2025-05-07T20:32:54.2410000Z x1 = x[:, D:] 2025-05-07T20:32:54.2410072Z 2025-05-07T20:32:54.2410155Z if contiguous: 2025-05-07T20:32:54.2410244Z x0 = x0.contiguous() 2025-05-07T20:32:54.2410329Z x1 = x1.contiguous() 2025-05-07T20:32:54.2410403Z 2025-05-07T20:32:54.2410493Z if scale_ub is not None: 2025-05-07T20:32:54.2410598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2410736Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2410808Z ) 2025-05-07T20:32:54.2410887Z else: 2025-05-07T20:32:54.2410979Z scale_ub_tensor = None 2025-05-07T20:32:54.2411050Z 2025-05-07T20:32:54.2411188Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2411275Z op = silu_mul_quant 2025-05-07T20:32:54.2411358Z if compiled: 2025-05-07T20:32:54.2411464Z op = torch.compile(op) 2025-05-07T20:32:54.2411568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2411643Z 2025-05-07T20:32:54.2411738Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2411742Z 2025-05-07T20:32:54.2411835Z moe/activation_test.py:117: 2025-05-07T20:32:54.2411968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2412066Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2412164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2412533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2412623Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2413156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2413256Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2413608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2413837Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2414168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2414256Z kernel = self.compile( 2025-05-07T20:32:54.2414633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2414806Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2414928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2414937Z 2025-05-07T20:32:54.2415139Z self = 2025-05-07T20:32:54.2415956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2416492Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a3f44a0>} 2025-05-07T20:32:54.2417219Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2417473Z context = 2025-05-07T20:32:54.2417477Z 2025-05-07T20:32:54.2417639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2417893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2418002Z module_map=module_map) 2025-05-07T20:32:54.2418159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2418300Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2418376Z E ^ 2025-05-07T20:32:54.2418723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2418728Z 2025-05-07T20:32:54.2419135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2419139Z 2025-05-07T20:32:54.2419241Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2419458Z self=, 2025-05-07T20:32:54.2419538Z T=4096, 2025-05-07T20:32:54.2419617Z D=5120, 2025-05-07T20:32:54.2419699Z scale_ub=None, 2025-05-07T20:32:54.2419784Z contiguous=False, 2025-05-07T20:32:54.2419863Z compiled=True, 2025-05-07T20:32:54.2419945Z ) 2025-05-07T20:32:54.2420159Z self = 2025-05-07T20:32:54.2420332Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2420336Z 2025-05-07T20:32:54.2420416Z @given( 2025-05-07T20:32:54.2420531Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2420626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2420745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2420857Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2420974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2421047Z ) 2025-05-07T20:32:54.2421287Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2421383Z def test_silu_mul_quant( 2025-05-07T20:32:54.2421460Z self, 2025-05-07T20:32:54.2421536Z T: int, 2025-05-07T20:32:54.2421614Z D: int, 2025-05-07T20:32:54.2421713Z scale_ub: Optional[float], 2025-05-07T20:32:54.2421800Z contiguous: bool, 2025-05-07T20:32:54.2421886Z compiled: bool, 2025-05-07T20:32:54.2421969Z ) -> None: 2025-05-07T20:32:54.2422059Z torch.manual_seed(2025) 2025-05-07T20:32:54.2422138Z 2025-05-07T20:32:54.2422301Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2422378Z 2025-05-07T20:32:54.2422466Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2422589Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2422686Z x = x_sign * x_clamp 2025-05-07T20:32:54.2422765Z x0 = x[:, :D] 2025-05-07T20:32:54.2422843Z x1 = x[:, D:] 2025-05-07T20:32:54.2422915Z 2025-05-07T20:32:54.2422996Z if contiguous: 2025-05-07T20:32:54.2423084Z x0 = x0.contiguous() 2025-05-07T20:32:54.2423173Z x1 = x1.contiguous() 2025-05-07T20:32:54.2423242Z 2025-05-07T20:32:54.2423380Z if scale_ub is not None: 2025-05-07T20:32:54.2423488Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2423658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2423741Z ) 2025-05-07T20:32:54.2427210Z else: 2025-05-07T20:32:54.2427316Z scale_ub_tensor = None 2025-05-07T20:32:54.2427388Z 2025-05-07T20:32:54.2427527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2427616Z op = silu_mul_quant 2025-05-07T20:32:54.2427702Z if compiled: 2025-05-07T20:32:54.2427869Z op = torch.compile(op) 2025-05-07T20:32:54.2427972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2428047Z 2025-05-07T20:32:54.2428134Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2428139Z 2025-05-07T20:32:54.2428236Z moe/activation_test.py:117: 2025-05-07T20:32:54.2428370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2428471Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2428567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2428989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2429081Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2429571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2429665Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2430014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2430242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2430575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2430672Z kernel = self.compile( 2025-05-07T20:32:54.2431049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2431226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2431357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2431361Z 2025-05-07T20:32:54.2431564Z self = 2025-05-07T20:32:54.2432326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2432827Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a3f51c0>} 2025-05-07T20:32:54.2433559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2433758Z context = 2025-05-07T20:32:54.2433763Z 2025-05-07T20:32:54.2433926Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2434190Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2434296Z module_map=module_map) 2025-05-07T20:32:54.2434458Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2434558Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2434631Z E ^ 2025-05-07T20:32:54.2434977Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2434990Z 2025-05-07T20:32:54.2435394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2435444Z 2025-05-07T20:32:54.2435586Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2435812Z self=, 2025-05-07T20:32:54.2435885Z T=4096, 2025-05-07T20:32:54.2435963Z D=5120, 2025-05-07T20:32:54.2436049Z scale_ub=1200.0, 2025-05-07T20:32:54.2436130Z contiguous=False, 2025-05-07T20:32:54.2436211Z compiled=False, 2025-05-07T20:32:54.2436288Z ) 2025-05-07T20:32:54.2436500Z self = 2025-05-07T20:32:54.2436716Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.2436721Z 2025-05-07T20:32:54.2436798Z @given( 2025-05-07T20:32:54.2436915Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2437017Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2437132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2437246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2437402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2437476Z ) 2025-05-07T20:32:54.2437716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2437813Z def test_silu_mul_quant( 2025-05-07T20:32:54.2437886Z self, 2025-05-07T20:32:54.2437963Z T: int, 2025-05-07T20:32:54.2438035Z D: int, 2025-05-07T20:32:54.2438130Z scale_ub: Optional[float], 2025-05-07T20:32:54.2438225Z contiguous: bool, 2025-05-07T20:32:54.2438310Z compiled: bool, 2025-05-07T20:32:54.2438383Z ) -> None: 2025-05-07T20:32:54.2438480Z torch.manual_seed(2025) 2025-05-07T20:32:54.2438548Z 2025-05-07T20:32:54.2438713Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2438789Z 2025-05-07T20:32:54.2438879Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2439000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2439091Z x = x_sign * x_clamp 2025-05-07T20:32:54.2439172Z x0 = x[:, :D] 2025-05-07T20:32:54.2439252Z x1 = x[:, D:] 2025-05-07T20:32:54.2439322Z 2025-05-07T20:32:54.2439401Z if contiguous: 2025-05-07T20:32:54.2439492Z x0 = x0.contiguous() 2025-05-07T20:32:54.2439576Z x1 = x1.contiguous() 2025-05-07T20:32:54.2439647Z 2025-05-07T20:32:54.2439738Z if scale_ub is not None: 2025-05-07T20:32:54.2439844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2439973Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2440047Z ) 2025-05-07T20:32:54.2440123Z else: 2025-05-07T20:32:54.2440216Z scale_ub_tensor = None 2025-05-07T20:32:54.2440294Z 2025-05-07T20:32:54.2440422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2440513Z op = silu_mul_quant 2025-05-07T20:32:54.2440599Z if compiled: 2025-05-07T20:32:54.2440697Z op = torch.compile(op) 2025-05-07T20:32:54.2440808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2440878Z 2025-05-07T20:32:54.2440966Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2440970Z 2025-05-07T20:32:54.2441070Z moe/activation_test.py:117: 2025-05-07T20:32:54.2441195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2441294Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2441395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2441886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2441985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2442336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2442605Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2442980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2443073Z kernel = self.compile( 2025-05-07T20:32:54.2443450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2443628Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2443751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2443794Z 2025-05-07T20:32:54.2444000Z self = 2025-05-07T20:32:54.2444757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2445294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a3f6160>} 2025-05-07T20:32:54.2446031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2446221Z context = 2025-05-07T20:32:54.2446228Z 2025-05-07T20:32:54.2446395Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2446650Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2446759Z module_map=module_map) 2025-05-07T20:32:54.2446919Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2447019Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2447098Z E ^ 2025-05-07T20:32:54.2447448Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2447453Z 2025-05-07T20:32:54.2447856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2447861Z 2025-05-07T20:32:54.2447963Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2448179Z self=, 2025-05-07T20:32:54.2448260Z T=4096, 2025-05-07T20:32:54.2448334Z D=5120, 2025-05-07T20:32:54.2448415Z scale_ub=1200.0, 2025-05-07T20:32:54.2448500Z contiguous=False, 2025-05-07T20:32:54.2448582Z compiled=True, 2025-05-07T20:32:54.2448655Z ) 2025-05-07T20:32:54.2448873Z self = 2025-05-07T20:32:54.2449044Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2449049Z 2025-05-07T20:32:54.2449123Z @given( 2025-05-07T20:32:54.2449249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2449345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2449461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2449576Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2449689Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2449761Z ) 2025-05-07T20:32:54.2450004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2450092Z def test_silu_mul_quant( 2025-05-07T20:32:54.2450170Z self, 2025-05-07T20:32:54.2450247Z T: int, 2025-05-07T20:32:54.2450322Z D: int, 2025-05-07T20:32:54.2450423Z scale_ub: Optional[float], 2025-05-07T20:32:54.2450509Z contiguous: bool, 2025-05-07T20:32:54.2450634Z compiled: bool, 2025-05-07T20:32:54.2450715Z ) -> None: 2025-05-07T20:32:54.2450805Z torch.manual_seed(2025) 2025-05-07T20:32:54.2450942Z 2025-05-07T20:32:54.2451111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2451182Z 2025-05-07T20:32:54.2451273Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2451395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2451478Z x = x_sign * x_clamp 2025-05-07T20:32:54.2451557Z x0 = x[:, :D] 2025-05-07T20:32:54.2451635Z x1 = x[:, D:] 2025-05-07T20:32:54.2451745Z 2025-05-07T20:32:54.2451828Z if contiguous: 2025-05-07T20:32:54.2451917Z x0 = x0.contiguous() 2025-05-07T20:32:54.2452001Z x1 = x1.contiguous() 2025-05-07T20:32:54.2452074Z 2025-05-07T20:32:54.2452161Z if scale_ub is not None: 2025-05-07T20:32:54.2452267Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2452402Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2452475Z ) 2025-05-07T20:32:54.2452554Z else: 2025-05-07T20:32:54.2452690Z scale_ub_tensor = None 2025-05-07T20:32:54.2452765Z 2025-05-07T20:32:54.2452899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2453071Z op = silu_mul_quant 2025-05-07T20:32:54.2453171Z if compiled: 2025-05-07T20:32:54.2453271Z op = torch.compile(op) 2025-05-07T20:32:54.2453374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2453447Z 2025-05-07T20:32:54.2453540Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2453544Z 2025-05-07T20:32:54.2453638Z moe/activation_test.py:117: 2025-05-07T20:32:54.2453768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2453866Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2453963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2454327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2454423Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2454905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2455002Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2455348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2455569Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2455902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2455994Z kernel = self.compile( 2025-05-07T20:32:54.2456373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2456546Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2456674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2456684Z 2025-05-07T20:32:54.2456888Z self = 2025-05-07T20:32:54.2457646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2458148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a3f7240>} 2025-05-07T20:32:54.2458876Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2459113Z context = 2025-05-07T20:32:54.2459117Z 2025-05-07T20:32:54.2459713Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2459998Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2460111Z module_map=module_map) 2025-05-07T20:32:54.2460272Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2460378Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2460527Z E ^ 2025-05-07T20:32:54.2460873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2460877Z 2025-05-07T20:32:54.2461284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2461292Z 2025-05-07T20:32:54.2461391Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2461610Z self=, 2025-05-07T20:32:54.2461745Z T=2048, 2025-05-07T20:32:54.2461819Z D=7168, 2025-05-07T20:32:54.2461902Z scale_ub=1200.0, 2025-05-07T20:32:54.2461984Z contiguous=False, 2025-05-07T20:32:54.2462065Z compiled=False, 2025-05-07T20:32:54.2462143Z ) 2025-05-07T20:32:54.2462354Z self = 2025-05-07T20:32:54.2462526Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.2462534Z 2025-05-07T20:32:54.2462612Z @given( 2025-05-07T20:32:54.2462728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2462825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2462941Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2463056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2463198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2463277Z ) 2025-05-07T20:32:54.2463539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2463633Z def test_silu_mul_quant( 2025-05-07T20:32:54.2463705Z self, 2025-05-07T20:32:54.2463781Z T: int, 2025-05-07T20:32:54.2463860Z D: int, 2025-05-07T20:32:54.2463954Z scale_ub: Optional[float], 2025-05-07T20:32:54.2464042Z contiguous: bool, 2025-05-07T20:32:54.2464127Z compiled: bool, 2025-05-07T20:32:54.2464205Z ) -> None: 2025-05-07T20:32:54.2464298Z torch.manual_seed(2025) 2025-05-07T20:32:54.2464376Z 2025-05-07T20:32:54.2464541Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2464618Z 2025-05-07T20:32:54.2464709Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2464830Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2464920Z x = x_sign * x_clamp 2025-05-07T20:32:54.2464997Z x0 = x[:, :D] 2025-05-07T20:32:54.2465073Z x1 = x[:, D:] 2025-05-07T20:32:54.2465151Z 2025-05-07T20:32:54.2465237Z if contiguous: 2025-05-07T20:32:54.2465322Z x0 = x0.contiguous() 2025-05-07T20:32:54.2465411Z x1 = x1.contiguous() 2025-05-07T20:32:54.2465483Z 2025-05-07T20:32:54.2465570Z if scale_ub is not None: 2025-05-07T20:32:54.2465675Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2465806Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2465888Z ) 2025-05-07T20:32:54.2465963Z else: 2025-05-07T20:32:54.2466054Z scale_ub_tensor = None 2025-05-07T20:32:54.2466127Z 2025-05-07T20:32:54.2466250Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2466336Z op = silu_mul_quant 2025-05-07T20:32:54.2466423Z if compiled: 2025-05-07T20:32:54.2466586Z op = torch.compile(op) 2025-05-07T20:32:54.2466687Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2466763Z 2025-05-07T20:32:54.2466891Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2466896Z 2025-05-07T20:32:54.2466991Z moe/activation_test.py:117: 2025-05-07T20:32:54.2467121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2467217Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2467316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2467804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2467940Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2468291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2468506Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2468843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2468978Z kernel = self.compile( 2025-05-07T20:32:54.2469349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2469525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2469651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2469655Z 2025-05-07T20:32:54.2469860Z self = 2025-05-07T20:32:54.2470622Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2471116Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f230220>} 2025-05-07T20:32:54.2471861Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2472048Z context = 2025-05-07T20:32:54.2472052Z 2025-05-07T20:32:54.2472217Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2472474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2472582Z module_map=module_map) 2025-05-07T20:32:54.2472740Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2472833Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2472908Z E ^ 2025-05-07T20:32:54.2473261Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2473266Z 2025-05-07T20:32:54.2473675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2473680Z 2025-05-07T20:32:54.2473780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2473997Z self=, 2025-05-07T20:32:54.2474073Z T=1, 2025-05-07T20:32:54.2474149Z D=7168, 2025-05-07T20:32:54.2474232Z scale_ub=None, 2025-05-07T20:32:54.2474318Z contiguous=True, 2025-05-07T20:32:54.2474403Z compiled=False, 2025-05-07T20:32:54.2474474Z ) 2025-05-07T20:32:54.2474686Z self = 2025-05-07T20:32:54.2474851Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2474856Z 2025-05-07T20:32:54.2474980Z @given( 2025-05-07T20:32:54.2475099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2475233Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2475347Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2475462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2475574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2475648Z ) 2025-05-07T20:32:54.2475891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2475978Z def test_silu_mul_quant( 2025-05-07T20:32:54.2476098Z self, 2025-05-07T20:32:54.2476173Z T: int, 2025-05-07T20:32:54.2476247Z D: int, 2025-05-07T20:32:54.2476344Z scale_ub: Optional[float], 2025-05-07T20:32:54.2476432Z contiguous: bool, 2025-05-07T20:32:54.2476514Z compiled: bool, 2025-05-07T20:32:54.2476590Z ) -> None: 2025-05-07T20:32:54.2476682Z torch.manual_seed(2025) 2025-05-07T20:32:54.2476751Z 2025-05-07T20:32:54.2476921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2476991Z 2025-05-07T20:32:54.2477125Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2477249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2477337Z x = x_sign * x_clamp 2025-05-07T20:32:54.2477418Z x0 = x[:, :D] 2025-05-07T20:32:54.2477495Z x1 = x[:, D:] 2025-05-07T20:32:54.2477569Z 2025-05-07T20:32:54.2477653Z if contiguous: 2025-05-07T20:32:54.2477745Z x0 = x0.contiguous() 2025-05-07T20:32:54.2477828Z x1 = x1.contiguous() 2025-05-07T20:32:54.2477899Z 2025-05-07T20:32:54.2477987Z if scale_ub is not None: 2025-05-07T20:32:54.2478094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2478225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2478298Z ) 2025-05-07T20:32:54.2478379Z else: 2025-05-07T20:32:54.2478469Z scale_ub_tensor = None 2025-05-07T20:32:54.2478539Z 2025-05-07T20:32:54.2478673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2478759Z op = silu_mul_quant 2025-05-07T20:32:54.2478838Z if compiled: 2025-05-07T20:32:54.2478937Z op = torch.compile(op) 2025-05-07T20:32:54.2479039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2479110Z 2025-05-07T20:32:54.2479203Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2479207Z 2025-05-07T20:32:54.2479301Z moe/activation_test.py:117: 2025-05-07T20:32:54.2479431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2479528Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2479623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2480113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2480210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2480563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2480783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2481112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2481212Z kernel = self.compile( 2025-05-07T20:32:54.2481585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2481757Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2481881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2481885Z 2025-05-07T20:32:54.2482087Z self = 2025-05-07T20:32:54.2482970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2483463Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f231120>} 2025-05-07T20:32:54.2484189Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2484418Z context = 2025-05-07T20:32:54.2484422Z 2025-05-07T20:32:54.2484583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2484838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2484943Z module_map=module_map) 2025-05-07T20:32:54.2485137Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2485238Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2485312Z E ^ 2025-05-07T20:32:54.2485659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2485663Z 2025-05-07T20:32:54.2486067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2486075Z 2025-05-07T20:32:54.2486173Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2486395Z self=, 2025-05-07T20:32:54.2486469Z T=16384, 2025-05-07T20:32:54.2486543Z D=7168, 2025-05-07T20:32:54.2486623Z scale_ub=1200.0, 2025-05-07T20:32:54.2486710Z contiguous=False, 2025-05-07T20:32:54.2486791Z compiled=True, 2025-05-07T20:32:54.2486862Z ) 2025-05-07T20:32:54.2487080Z self = 2025-05-07T20:32:54.2487256Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2487261Z 2025-05-07T20:32:54.2487332Z @given( 2025-05-07T20:32:54.2487447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2487556Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2487667Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2487784Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2487897Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2487970Z ) 2025-05-07T20:32:54.2488210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2488301Z def test_silu_mul_quant( 2025-05-07T20:32:54.2488373Z self, 2025-05-07T20:32:54.2488451Z T: int, 2025-05-07T20:32:54.2488527Z D: int, 2025-05-07T20:32:54.2488623Z scale_ub: Optional[float], 2025-05-07T20:32:54.2488722Z contiguous: bool, 2025-05-07T20:32:54.2488803Z compiled: bool, 2025-05-07T20:32:54.2488880Z ) -> None: 2025-05-07T20:32:54.2488973Z torch.manual_seed(2025) 2025-05-07T20:32:54.2489042Z 2025-05-07T20:32:54.2489205Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2489281Z 2025-05-07T20:32:54.2489368Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2489494Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2489581Z x = x_sign * x_clamp 2025-05-07T20:32:54.2489658Z x0 = x[:, :D] 2025-05-07T20:32:54.2489733Z x1 = x[:, D:] 2025-05-07T20:32:54.2489804Z 2025-05-07T20:32:54.2489883Z if contiguous: 2025-05-07T20:32:54.2489976Z x0 = x0.contiguous() 2025-05-07T20:32:54.2490108Z x1 = x1.contiguous() 2025-05-07T20:32:54.2490178Z 2025-05-07T20:32:54.2490266Z if scale_ub is not None: 2025-05-07T20:32:54.2490413Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2490545Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2490618Z ) 2025-05-07T20:32:54.2490696Z else: 2025-05-07T20:32:54.2490786Z scale_ub_tensor = None 2025-05-07T20:32:54.2490857Z 2025-05-07T20:32:54.2490981Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2491065Z op = silu_mul_quant 2025-05-07T20:32:54.2491194Z if compiled: 2025-05-07T20:32:54.2491291Z op = torch.compile(op) 2025-05-07T20:32:54.2491396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2491466Z 2025-05-07T20:32:54.2491553Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2491558Z 2025-05-07T20:32:54.2491653Z moe/activation_test.py:117: 2025-05-07T20:32:54.2491780Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2491876Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2492015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2492379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2492471Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2493012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2493110Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2493461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2493678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2494007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2494102Z kernel = self.compile( 2025-05-07T20:32:54.2494477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2494648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2494769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2494774Z 2025-05-07T20:32:54.2494975Z self = 2025-05-07T20:32:54.2495736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2496230Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f232520>} 2025-05-07T20:32:54.2496965Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2497153Z context = 2025-05-07T20:32:54.2497157Z 2025-05-07T20:32:54.2497316Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2497572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2497679Z module_map=module_map) 2025-05-07T20:32:54.2497838Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2497931Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2498005Z E ^ 2025-05-07T20:32:54.2498358Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2498411Z 2025-05-07T20:32:54.2498852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2498862Z 2025-05-07T20:32:54.2498965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2499181Z self=, 2025-05-07T20:32:54.2499254Z T=1, 2025-05-07T20:32:54.2499328Z D=7168, 2025-05-07T20:32:54.2499403Z scale_ub=None, 2025-05-07T20:32:54.2499489Z contiguous=False, 2025-05-07T20:32:54.2499578Z compiled=False, 2025-05-07T20:32:54.2499689Z ) 2025-05-07T20:32:54.2499901Z self = 2025-05-07T20:32:54.2500066Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2500070Z 2025-05-07T20:32:54.2500145Z @given( 2025-05-07T20:32:54.2500264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2500362Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2500472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2500631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2500743Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2500816Z ) 2025-05-07T20:32:54.2501059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2501146Z def test_silu_mul_quant( 2025-05-07T20:32:54.2501219Z self, 2025-05-07T20:32:54.2501295Z T: int, 2025-05-07T20:32:54.2501372Z D: int, 2025-05-07T20:32:54.2501465Z scale_ub: Optional[float], 2025-05-07T20:32:54.2501554Z contiguous: bool, 2025-05-07T20:32:54.2501635Z compiled: bool, 2025-05-07T20:32:54.2501715Z ) -> None: 2025-05-07T20:32:54.2501805Z torch.manual_seed(2025) 2025-05-07T20:32:54.2501875Z 2025-05-07T20:32:54.2502041Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2502116Z 2025-05-07T20:32:54.2502203Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2502336Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2502422Z x = x_sign * x_clamp 2025-05-07T20:32:54.2502497Z x0 = x[:, :D] 2025-05-07T20:32:54.2502577Z x1 = x[:, D:] 2025-05-07T20:32:54.2502647Z 2025-05-07T20:32:54.2502724Z if contiguous: 2025-05-07T20:32:54.2502815Z x0 = x0.contiguous() 2025-05-07T20:32:54.2502901Z x1 = x1.contiguous() 2025-05-07T20:32:54.2502971Z 2025-05-07T20:32:54.2503063Z if scale_ub is not None: 2025-05-07T20:32:54.2503164Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2503296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2503365Z ) 2025-05-07T20:32:54.2503438Z else: 2025-05-07T20:32:54.2503528Z scale_ub_tensor = None 2025-05-07T20:32:54.2503597Z 2025-05-07T20:32:54.2503726Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2503813Z op = silu_mul_quant 2025-05-07T20:32:54.2503896Z if compiled: 2025-05-07T20:32:54.2503992Z op = torch.compile(op) 2025-05-07T20:32:54.2504094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2504163Z 2025-05-07T20:32:54.2504257Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2504261Z 2025-05-07T20:32:54.2504353Z moe/activation_test.py:117: 2025-05-07T20:32:54.2504475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2504581Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2504676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2505160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2505257Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2505655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2505916Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2506248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2506339Z kernel = self.compile( 2025-05-07T20:32:54.2506713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2506882Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2507045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2507055Z 2025-05-07T20:32:54.2507258Z self = 2025-05-07T20:32:54.2508014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2508576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f233100>} 2025-05-07T20:32:54.2509304Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2509496Z context = 2025-05-07T20:32:54.2509501Z 2025-05-07T20:32:54.2509659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2509911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2510019Z module_map=module_map) 2025-05-07T20:32:54.2510177Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2510270Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2510348Z E ^ 2025-05-07T20:32:54.2510696Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2510701Z 2025-05-07T20:32:54.2511108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2511112Z 2025-05-07T20:32:54.2511212Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2511430Z self=, 2025-05-07T20:32:54.2511505Z T=2048, 2025-05-07T20:32:54.2511577Z D=7168, 2025-05-07T20:32:54.2511658Z scale_ub=None, 2025-05-07T20:32:54.2511740Z contiguous=False, 2025-05-07T20:32:54.2511815Z compiled=True, 2025-05-07T20:32:54.2511884Z ) 2025-05-07T20:32:54.2512096Z self = 2025-05-07T20:32:54.2512265Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2512277Z 2025-05-07T20:32:54.2512353Z @given( 2025-05-07T20:32:54.2512467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2512563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2512675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2512787Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2512900Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2512975Z ) 2025-05-07T20:32:54.2513212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2513305Z def test_silu_mul_quant( 2025-05-07T20:32:54.2513378Z self, 2025-05-07T20:32:54.2513449Z T: int, 2025-05-07T20:32:54.2513525Z D: int, 2025-05-07T20:32:54.2513620Z scale_ub: Optional[float], 2025-05-07T20:32:54.2513785Z contiguous: bool, 2025-05-07T20:32:54.2513868Z compiled: bool, 2025-05-07T20:32:54.2513942Z ) -> None: 2025-05-07T20:32:54.2514076Z torch.manual_seed(2025) 2025-05-07T20:32:54.2514147Z 2025-05-07T20:32:54.2514311Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2514388Z 2025-05-07T20:32:54.2514480Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2514600Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2514688Z x = x_sign * x_clamp 2025-05-07T20:32:54.2514805Z x0 = x[:, :D] 2025-05-07T20:32:54.2514879Z x1 = x[:, D:] 2025-05-07T20:32:54.2514947Z 2025-05-07T20:32:54.2515030Z if contiguous: 2025-05-07T20:32:54.2515116Z x0 = x0.contiguous() 2025-05-07T20:32:54.2515205Z x1 = x1.contiguous() 2025-05-07T20:32:54.2515272Z 2025-05-07T20:32:54.2515359Z if scale_ub is not None: 2025-05-07T20:32:54.2515468Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2515596Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2515670Z ) 2025-05-07T20:32:54.2515788Z else: 2025-05-07T20:32:54.2515879Z scale_ub_tensor = None 2025-05-07T20:32:54.2515955Z 2025-05-07T20:32:54.2516082Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2516169Z op = silu_mul_quant 2025-05-07T20:32:54.2516254Z if compiled: 2025-05-07T20:32:54.2516355Z op = torch.compile(op) 2025-05-07T20:32:54.2516461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2516529Z 2025-05-07T20:32:54.2516617Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2516622Z 2025-05-07T20:32:54.2516712Z moe/activation_test.py:117: 2025-05-07T20:32:54.2516838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2516936Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2517041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2517406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2517496Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2517980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2518072Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2518418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2518642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2518971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2519063Z kernel = self.compile( 2025-05-07T20:32:54.2519434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2519609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2519739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2519744Z 2025-05-07T20:32:54.2519947Z self = 2025-05-07T20:32:54.2520709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2521208Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a0f8720>} 2025-05-07T20:32:54.2521935Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2522211Z context = 2025-05-07T20:32:54.2522216Z 2025-05-07T20:32:54.2522378Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2522635Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2522738Z module_map=module_map) 2025-05-07T20:32:54.2522896Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2523118Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2523190Z E ^ 2025-05-07T20:32:54.2523534Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2523543Z 2025-05-07T20:32:54.2523945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2523952Z 2025-05-07T20:32:54.2524050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2524312Z self=, 2025-05-07T20:32:54.2524386Z T=4096, 2025-05-07T20:32:54.2524461Z D=7168, 2025-05-07T20:32:54.2524543Z scale_ub=None, 2025-05-07T20:32:54.2524631Z contiguous=False, 2025-05-07T20:32:54.2524710Z compiled=True, 2025-05-07T20:32:54.2524787Z ) 2025-05-07T20:32:54.2525000Z self = 2025-05-07T20:32:54.2525176Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2525180Z 2025-05-07T20:32:54.2525251Z @given( 2025-05-07T20:32:54.2525367Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2525470Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2525579Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2525693Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2525805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2525880Z ) 2025-05-07T20:32:54.2526125Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2526214Z def test_silu_mul_quant( 2025-05-07T20:32:54.2526288Z self, 2025-05-07T20:32:54.2526367Z T: int, 2025-05-07T20:32:54.2526438Z D: int, 2025-05-07T20:32:54.2526534Z scale_ub: Optional[float], 2025-05-07T20:32:54.2526623Z contiguous: bool, 2025-05-07T20:32:54.2526708Z compiled: bool, 2025-05-07T20:32:54.2526781Z ) -> None: 2025-05-07T20:32:54.2526879Z torch.manual_seed(2025) 2025-05-07T20:32:54.2526948Z 2025-05-07T20:32:54.2527112Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2527182Z 2025-05-07T20:32:54.2527270Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2527393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2527481Z x = x_sign * x_clamp 2025-05-07T20:32:54.2527556Z x0 = x[:, :D] 2025-05-07T20:32:54.2527639Z x1 = x[:, D:] 2025-05-07T20:32:54.2527709Z 2025-05-07T20:32:54.2527788Z if contiguous: 2025-05-07T20:32:54.2527876Z x0 = x0.contiguous() 2025-05-07T20:32:54.2527959Z x1 = x1.contiguous() 2025-05-07T20:32:54.2528026Z 2025-05-07T20:32:54.2528113Z if scale_ub is not None: 2025-05-07T20:32:54.2528214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2528344Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2528421Z ) 2025-05-07T20:32:54.2528492Z else: 2025-05-07T20:32:54.2528582Z scale_ub_tensor = None 2025-05-07T20:32:54.2528654Z 2025-05-07T20:32:54.2528778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2528865Z op = silu_mul_quant 2025-05-07T20:32:54.2528994Z if compiled: 2025-05-07T20:32:54.2529089Z op = torch.compile(op) 2025-05-07T20:32:54.2529234Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2529302Z 2025-05-07T20:32:54.2529389Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2529393Z 2025-05-07T20:32:54.2529489Z moe/activation_test.py:117: 2025-05-07T20:32:54.2529613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2529710Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2529808Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2530206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2530296Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2530774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2530870Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2531219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2531478Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2531808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2531900Z kernel = self.compile( 2025-05-07T20:32:54.2532270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2532450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2532574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2532578Z 2025-05-07T20:32:54.2532779Z self = 2025-05-07T20:32:54.2533652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2534150Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a0f9440>} 2025-05-07T20:32:54.2534880Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2535070Z context = 2025-05-07T20:32:54.2535074Z 2025-05-07T20:32:54.2535238Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2535493Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2535598Z module_map=module_map) 2025-05-07T20:32:54.2535763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2535861Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2535935Z E ^ 2025-05-07T20:32:54.2536284Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2536289Z 2025-05-07T20:32:54.2536696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2536703Z 2025-05-07T20:32:54.2536803Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2537018Z self=, 2025-05-07T20:32:54.2537092Z T=16384, 2025-05-07T20:32:54.2537165Z D=5120, 2025-05-07T20:32:54.2537241Z scale_ub=1200.0, 2025-05-07T20:32:54.2537324Z contiguous=False, 2025-05-07T20:32:54.2537404Z compiled=False, 2025-05-07T20:32:54.2537520Z ) 2025-05-07T20:32:54.2537729Z self = 2025-05-07T20:32:54.2537951Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.2537956Z 2025-05-07T20:32:54.2538031Z @given( 2025-05-07T20:32:54.2538149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2538244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2538355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2538473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2538646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2538719Z ) 2025-05-07T20:32:54.2538964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2539050Z def test_silu_mul_quant( 2025-05-07T20:32:54.2539123Z self, 2025-05-07T20:32:54.2539195Z T: int, 2025-05-07T20:32:54.2539267Z D: int, 2025-05-07T20:32:54.2539364Z scale_ub: Optional[float], 2025-05-07T20:32:54.2539450Z contiguous: bool, 2025-05-07T20:32:54.2539572Z compiled: bool, 2025-05-07T20:32:54.2539650Z ) -> None: 2025-05-07T20:32:54.2539743Z torch.manual_seed(2025) 2025-05-07T20:32:54.2539812Z 2025-05-07T20:32:54.2539978Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2540046Z 2025-05-07T20:32:54.2540132Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2540254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2540343Z x = x_sign * x_clamp 2025-05-07T20:32:54.2540419Z x0 = x[:, :D] 2025-05-07T20:32:54.2540500Z x1 = x[:, D:] 2025-05-07T20:32:54.2540569Z 2025-05-07T20:32:54.2540653Z if contiguous: 2025-05-07T20:32:54.2540741Z x0 = x0.contiguous() 2025-05-07T20:32:54.2540826Z x1 = x1.contiguous() 2025-05-07T20:32:54.2540900Z 2025-05-07T20:32:54.2540990Z if scale_ub is not None: 2025-05-07T20:32:54.2541091Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2541231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2541301Z ) 2025-05-07T20:32:54.2541372Z else: 2025-05-07T20:32:54.2541466Z scale_ub_tensor = None 2025-05-07T20:32:54.2541538Z 2025-05-07T20:32:54.2541662Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2541751Z op = silu_mul_quant 2025-05-07T20:32:54.2541830Z if compiled: 2025-05-07T20:32:54.2541931Z op = torch.compile(op) 2025-05-07T20:32:54.2542030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2542101Z 2025-05-07T20:32:54.2542191Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2542195Z 2025-05-07T20:32:54.2542288Z moe/activation_test.py:117: 2025-05-07T20:32:54.2542412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2542518Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2542612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2543146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2546448Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2546821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2547044Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2547382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2547475Z kernel = self.compile( 2025-05-07T20:32:54.2547851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2548021Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2548212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2548266Z 2025-05-07T20:32:54.2548473Z self = 2025-05-07T20:32:54.2549231Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2549731Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a0fa340>} 2025-05-07T20:32:54.2550502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2550694Z context = 2025-05-07T20:32:54.2550699Z 2025-05-07T20:32:54.2550900Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2551155Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2551264Z module_map=module_map) 2025-05-07T20:32:54.2551423Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2551516Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2551590Z E ^ 2025-05-07T20:32:54.2551940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2551945Z 2025-05-07T20:32:54.2552353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2552358Z 2025-05-07T20:32:54.2552459Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2552678Z self=, 2025-05-07T20:32:54.2552753Z T=16384, 2025-05-07T20:32:54.2552835Z D=5120, 2025-05-07T20:32:54.2552917Z scale_ub=1200.0, 2025-05-07T20:32:54.2552997Z contiguous=True, 2025-05-07T20:32:54.2553074Z compiled=True, 2025-05-07T20:32:54.2553147Z ) 2025-05-07T20:32:54.2553363Z self = 2025-05-07T20:32:54.2553531Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.2553539Z 2025-05-07T20:32:54.2553619Z @given( 2025-05-07T20:32:54.2553734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2553831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2553948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2554064Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2554179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2554254Z ) 2025-05-07T20:32:54.2554494Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2554590Z def test_silu_mul_quant( 2025-05-07T20:32:54.2554663Z self, 2025-05-07T20:32:54.2554738Z T: int, 2025-05-07T20:32:54.2554814Z D: int, 2025-05-07T20:32:54.2554908Z scale_ub: Optional[float], 2025-05-07T20:32:54.2554992Z contiguous: bool, 2025-05-07T20:32:54.2555080Z compiled: bool, 2025-05-07T20:32:54.2555155Z ) -> None: 2025-05-07T20:32:54.2555250Z torch.manual_seed(2025) 2025-05-07T20:32:54.2555323Z 2025-05-07T20:32:54.2555487Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2555563Z 2025-05-07T20:32:54.2555652Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2555772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2555861Z x = x_sign * x_clamp 2025-05-07T20:32:54.2555987Z x0 = x[:, :D] 2025-05-07T20:32:54.2556063Z x1 = x[:, D:] 2025-05-07T20:32:54.2556136Z 2025-05-07T20:32:54.2556257Z if contiguous: 2025-05-07T20:32:54.2556347Z x0 = x0.contiguous() 2025-05-07T20:32:54.2556435Z x1 = x1.contiguous() 2025-05-07T20:32:54.2556503Z 2025-05-07T20:32:54.2556588Z if scale_ub is not None: 2025-05-07T20:32:54.2556694Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2556823Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2556896Z ) 2025-05-07T20:32:54.2557014Z else: 2025-05-07T20:32:54.2557104Z scale_ub_tensor = None 2025-05-07T20:32:54.2557176Z 2025-05-07T20:32:54.2557301Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2557385Z op = silu_mul_quant 2025-05-07T20:32:54.2557467Z if compiled: 2025-05-07T20:32:54.2557565Z op = torch.compile(op) 2025-05-07T20:32:54.2557670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2557740Z 2025-05-07T20:32:54.2557826Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2557872Z 2025-05-07T20:32:54.2557965Z moe/activation_test.py:117: 2025-05-07T20:32:54.2558093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2558192Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2558292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2558653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2558743Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2559511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2559612Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2559963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2560191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2560527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2560619Z kernel = self.compile( 2025-05-07T20:32:54.2560991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2561163Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2561295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2561299Z 2025-05-07T20:32:54.2561501Z self = 2025-05-07T20:32:54.2562262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2562763Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc88a0fb9c0>} 2025-05-07T20:32:54.2563493Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2563684Z context = 2025-05-07T20:32:54.2563691Z 2025-05-07T20:32:54.2563849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2564109Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2564212Z module_map=module_map) 2025-05-07T20:32:54.2564366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2564547Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2564619Z E ^ 2025-05-07T20:32:54.2565031Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2565036Z 2025-05-07T20:32:54.2565442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2565447Z 2025-05-07T20:32:54.2565548Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2565768Z self=, 2025-05-07T20:32:54.2565903Z T=16384, 2025-05-07T20:32:54.2565978Z D=5120, 2025-05-07T20:32:54.2566063Z scale_ub=None, 2025-05-07T20:32:54.2566148Z contiguous=False, 2025-05-07T20:32:54.2566228Z compiled=True, 2025-05-07T20:32:54.2566295Z ) 2025-05-07T20:32:54.2566508Z self = 2025-05-07T20:32:54.2566687Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2566692Z 2025-05-07T20:32:54.2566769Z @given( 2025-05-07T20:32:54.2566940Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2567041Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2567151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2567263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2567376Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2567450Z ) 2025-05-07T20:32:54.2567693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2567782Z def test_silu_mul_quant( 2025-05-07T20:32:54.2567856Z self, 2025-05-07T20:32:54.2567935Z T: int, 2025-05-07T20:32:54.2568010Z D: int, 2025-05-07T20:32:54.2568106Z scale_ub: Optional[float], 2025-05-07T20:32:54.2568194Z contiguous: bool, 2025-05-07T20:32:54.2568277Z compiled: bool, 2025-05-07T20:32:54.2568351Z ) -> None: 2025-05-07T20:32:54.2568446Z torch.manual_seed(2025) 2025-05-07T20:32:54.2568517Z 2025-05-07T20:32:54.2568680Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2568757Z 2025-05-07T20:32:54.2568849Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2568975Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2569059Z x = x_sign * x_clamp 2025-05-07T20:32:54.2569135Z x0 = x[:, :D] 2025-05-07T20:32:54.2569217Z x1 = x[:, D:] 2025-05-07T20:32:54.2569291Z 2025-05-07T20:32:54.2569372Z if contiguous: 2025-05-07T20:32:54.2569464Z x0 = x0.contiguous() 2025-05-07T20:32:54.2569549Z x1 = x1.contiguous() 2025-05-07T20:32:54.2569618Z 2025-05-07T20:32:54.2569706Z if scale_ub is not None: 2025-05-07T20:32:54.2569806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2569939Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2570014Z ) 2025-05-07T20:32:54.2570089Z else: 2025-05-07T20:32:54.2570182Z scale_ub_tensor = None 2025-05-07T20:32:54.2570254Z 2025-05-07T20:32:54.2570379Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2570467Z op = silu_mul_quant 2025-05-07T20:32:54.2570552Z if compiled: 2025-05-07T20:32:54.2570646Z op = torch.compile(op) 2025-05-07T20:32:54.2570753Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2570830Z 2025-05-07T20:32:54.2570918Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2570922Z 2025-05-07T20:32:54.2571019Z moe/activation_test.py:117: 2025-05-07T20:32:54.2571143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2571244Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2571340Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2571748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2571906Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2572397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2572492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2572841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2573151Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2573484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2573573Z kernel = self.compile( 2025-05-07T20:32:54.2573943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2574118Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2574285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2574290Z 2025-05-07T20:32:54.2574494Z self = 2025-05-07T20:32:54.2575255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2575756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f34cc20>} 2025-05-07T20:32:54.2576487Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2576677Z context = 2025-05-07T20:32:54.2576684Z 2025-05-07T20:32:54.2576850Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2577103Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2577210Z module_map=module_map) 2025-05-07T20:32:54.2577368Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2577465Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2577538Z E ^ 2025-05-07T20:32:54.2577886Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2577891Z 2025-05-07T20:32:54.2578299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2578307Z 2025-05-07T20:32:54.2578412Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2578632Z self=, 2025-05-07T20:32:54.2578709Z T=2048, 2025-05-07T20:32:54.2578785Z D=5120, 2025-05-07T20:32:54.2578866Z scale_ub=None, 2025-05-07T20:32:54.2578950Z contiguous=False, 2025-05-07T20:32:54.2579031Z compiled=True, 2025-05-07T20:32:54.2579100Z ) 2025-05-07T20:32:54.2579316Z self = 2025-05-07T20:32:54.2579482Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2579489Z 2025-05-07T20:32:54.2579564Z @given( 2025-05-07T20:32:54.2579680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2579775Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2579885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2580000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2580157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2580225Z ) 2025-05-07T20:32:54.2580509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2580601Z def test_silu_mul_quant( 2025-05-07T20:32:54.2580674Z self, 2025-05-07T20:32:54.2580745Z T: int, 2025-05-07T20:32:54.2580819Z D: int, 2025-05-07T20:32:54.2580920Z scale_ub: Optional[float], 2025-05-07T20:32:54.2581006Z contiguous: bool, 2025-05-07T20:32:54.2581088Z compiled: bool, 2025-05-07T20:32:54.2581206Z ) -> None: 2025-05-07T20:32:54.2581297Z torch.manual_seed(2025) 2025-05-07T20:32:54.2581368Z 2025-05-07T20:32:54.2581534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2581607Z 2025-05-07T20:32:54.2581694Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2581816Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2581904Z x = x_sign * x_clamp 2025-05-07T20:32:54.2581984Z x0 = x[:, :D] 2025-05-07T20:32:54.2582060Z x1 = x[:, D:] 2025-05-07T20:32:54.2582175Z 2025-05-07T20:32:54.2582257Z if contiguous: 2025-05-07T20:32:54.2582343Z x0 = x0.contiguous() 2025-05-07T20:32:54.2582428Z x1 = x1.contiguous() 2025-05-07T20:32:54.2582501Z 2025-05-07T20:32:54.2582589Z if scale_ub is not None: 2025-05-07T20:32:54.2582691Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2582827Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2582902Z ) 2025-05-07T20:32:54.2582980Z else: 2025-05-07T20:32:54.2583075Z scale_ub_tensor = None 2025-05-07T20:32:54.2583145Z 2025-05-07T20:32:54.2583270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2583357Z op = silu_mul_quant 2025-05-07T20:32:54.2583438Z if compiled: 2025-05-07T20:32:54.2583539Z op = torch.compile(op) 2025-05-07T20:32:54.2583640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2583710Z 2025-05-07T20:32:54.2583806Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2583810Z 2025-05-07T20:32:54.2583902Z moe/activation_test.py:117: 2025-05-07T20:32:54.2584025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2584125Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2584221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2584578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2584678Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2585158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2585254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2585604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2585824Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2586158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2586247Z kernel = self.compile( 2025-05-07T20:32:54.2586621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2586790Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2586915Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2586919Z 2025-05-07T20:32:54.2587123Z self = 2025-05-07T20:32:54.2587879Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2588464Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f34d9e0>} 2025-05-07T20:32:54.2589193Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2589420Z context = 2025-05-07T20:32:54.2589425Z 2025-05-07T20:32:54.2589586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2589837Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2589943Z module_map=module_map) 2025-05-07T20:32:54.2590101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2590198Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2590316Z E ^ 2025-05-07T20:32:54.2590666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2590671Z 2025-05-07T20:32:54.2591077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2591082Z 2025-05-07T20:32:54.2591178Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2591398Z self=, 2025-05-07T20:32:54.2591477Z T=2048, 2025-05-07T20:32:54.2591553Z D=5120, 2025-05-07T20:32:54.2591632Z scale_ub=1200.0, 2025-05-07T20:32:54.2591717Z contiguous=False, 2025-05-07T20:32:54.2591799Z compiled=True, 2025-05-07T20:32:54.2591868Z ) 2025-05-07T20:32:54.2592087Z self = 2025-05-07T20:32:54.2592256Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2592264Z 2025-05-07T20:32:54.2592341Z @given( 2025-05-07T20:32:54.2592456Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2592553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2592668Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2592779Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2592888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2592960Z ) 2025-05-07T20:32:54.2593198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2593288Z def test_silu_mul_quant( 2025-05-07T20:32:54.2593367Z self, 2025-05-07T20:32:54.2593440Z T: int, 2025-05-07T20:32:54.2593520Z D: int, 2025-05-07T20:32:54.2593616Z scale_ub: Optional[float], 2025-05-07T20:32:54.2593703Z contiguous: bool, 2025-05-07T20:32:54.2593786Z compiled: bool, 2025-05-07T20:32:54.2593861Z ) -> None: 2025-05-07T20:32:54.2593956Z torch.manual_seed(2025) 2025-05-07T20:32:54.2594028Z 2025-05-07T20:32:54.2594194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2594267Z 2025-05-07T20:32:54.2594356Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2594475Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2594560Z x = x_sign * x_clamp 2025-05-07T20:32:54.2594640Z x0 = x[:, :D] 2025-05-07T20:32:54.2594718Z x1 = x[:, D:] 2025-05-07T20:32:54.2594792Z 2025-05-07T20:32:54.2594876Z if contiguous: 2025-05-07T20:32:54.2594963Z x0 = x0.contiguous() 2025-05-07T20:32:54.2595049Z x1 = x1.contiguous() 2025-05-07T20:32:54.2595118Z 2025-05-07T20:32:54.2595205Z if scale_ub is not None: 2025-05-07T20:32:54.2595356Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2595484Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2595601Z ) 2025-05-07T20:32:54.2595678Z else: 2025-05-07T20:32:54.2595769Z scale_ub_tensor = None 2025-05-07T20:32:54.2595839Z 2025-05-07T20:32:54.2595963Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2596048Z op = silu_mul_quant 2025-05-07T20:32:54.2596133Z if compiled: 2025-05-07T20:32:54.2596229Z op = torch.compile(op) 2025-05-07T20:32:54.2596370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2596445Z 2025-05-07T20:32:54.2596534Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2596538Z 2025-05-07T20:32:54.2596636Z moe/activation_test.py:117: 2025-05-07T20:32:54.2596761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2596857Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2596958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2597358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2597447Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2597932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2598025Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2598376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2598596Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2598925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2599019Z kernel = self.compile( 2025-05-07T20:32:54.2599390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2599563Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2599693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2599697Z 2025-05-07T20:32:54.2599899Z self = 2025-05-07T20:32:54.2600659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2601155Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f34eb60>} 2025-05-07T20:32:54.2601885Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2602077Z context = 2025-05-07T20:32:54.2602081Z 2025-05-07T20:32:54.2602240Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2602495Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2602600Z module_map=module_map) 2025-05-07T20:32:54.2602756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2602856Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2602928Z E ^ 2025-05-07T20:32:54.2603281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2603286Z 2025-05-07T20:32:54.2603686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2603758Z 2025-05-07T20:32:54.2603856Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2604117Z self=, 2025-05-07T20:32:54.2604192Z T=4096, 2025-05-07T20:32:54.2604267Z D=5120, 2025-05-07T20:32:54.2604345Z scale_ub=1200.0, 2025-05-07T20:32:54.2604425Z contiguous=True, 2025-05-07T20:32:54.2604508Z compiled=True, 2025-05-07T20:32:54.2604580Z ) 2025-05-07T20:32:54.2604792Z self = 2025-05-07T20:32:54.2605001Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.2605006Z 2025-05-07T20:32:54.2605079Z @given( 2025-05-07T20:32:54.2605191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2605293Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2605406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2605523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2605634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2605745Z ) 2025-05-07T20:32:54.2605986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2606075Z def test_silu_mul_quant( 2025-05-07T20:32:54.2606149Z self, 2025-05-07T20:32:54.2606225Z T: int, 2025-05-07T20:32:54.2606295Z D: int, 2025-05-07T20:32:54.2606389Z scale_ub: Optional[float], 2025-05-07T20:32:54.2606490Z contiguous: bool, 2025-05-07T20:32:54.2606572Z compiled: bool, 2025-05-07T20:32:54.2606646Z ) -> None: 2025-05-07T20:32:54.2606739Z torch.manual_seed(2025) 2025-05-07T20:32:54.2606806Z 2025-05-07T20:32:54.2606971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2607041Z 2025-05-07T20:32:54.2607130Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2607253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2607340Z x = x_sign * x_clamp 2025-05-07T20:32:54.2607417Z x0 = x[:, :D] 2025-05-07T20:32:54.2607504Z x1 = x[:, D:] 2025-05-07T20:32:54.2607573Z 2025-05-07T20:32:54.2607654Z if contiguous: 2025-05-07T20:32:54.2607743Z x0 = x0.contiguous() 2025-05-07T20:32:54.2607832Z x1 = x1.contiguous() 2025-05-07T20:32:54.2607902Z 2025-05-07T20:32:54.2607991Z if scale_ub is not None: 2025-05-07T20:32:54.2608092Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2608225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2608298Z ) 2025-05-07T20:32:54.2608370Z else: 2025-05-07T20:32:54.2608461Z scale_ub_tensor = None 2025-05-07T20:32:54.2608535Z 2025-05-07T20:32:54.2608662Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2608752Z op = silu_mul_quant 2025-05-07T20:32:54.2608841Z if compiled: 2025-05-07T20:32:54.2608936Z op = torch.compile(op) 2025-05-07T20:32:54.2609045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2609115Z 2025-05-07T20:32:54.2609200Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2609205Z 2025-05-07T20:32:54.2609300Z moe/activation_test.py:117: 2025-05-07T20:32:54.2609423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2609519Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2609616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2609977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2610066Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2610547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2610688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2611084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2611305Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2611634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2611726Z kernel = self.compile( 2025-05-07T20:32:54.2612096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2612308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2612429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2612434Z 2025-05-07T20:32:54.2612633Z self = 2025-05-07T20:32:54.2613509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2614007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f4ec180>} 2025-05-07T20:32:54.2614739Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2614928Z context = 2025-05-07T20:32:54.2614933Z 2025-05-07T20:32:54.2615097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2615349Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2615455Z module_map=module_map) 2025-05-07T20:32:54.2615613Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2615711Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2615785Z E ^ 2025-05-07T20:32:54.2616134Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2616138Z 2025-05-07T20:32:54.2616541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2616548Z 2025-05-07T20:32:54.2616647Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2616864Z self=, 2025-05-07T20:32:54.2616939Z T=128, 2025-05-07T20:32:54.2617016Z D=5120, 2025-05-07T20:32:54.2617096Z scale_ub=1200.0, 2025-05-07T20:32:54.2617178Z contiguous=False, 2025-05-07T20:32:54.2617261Z compiled=True, 2025-05-07T20:32:54.2617332Z ) 2025-05-07T20:32:54.2617544Z self = 2025-05-07T20:32:54.2617718Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2617722Z 2025-05-07T20:32:54.2617793Z @given( 2025-05-07T20:32:54.2617912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2618007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2618117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2618233Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2618345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2618417Z ) 2025-05-07T20:32:54.2618657Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2618746Z def test_silu_mul_quant( 2025-05-07T20:32:54.2618819Z self, 2025-05-07T20:32:54.2618897Z T: int, 2025-05-07T20:32:54.2619015Z D: int, 2025-05-07T20:32:54.2619112Z scale_ub: Optional[float], 2025-05-07T20:32:54.2619198Z contiguous: bool, 2025-05-07T20:32:54.2619321Z compiled: bool, 2025-05-07T20:32:54.2619400Z ) -> None: 2025-05-07T20:32:54.2619493Z torch.manual_seed(2025) 2025-05-07T20:32:54.2619565Z 2025-05-07T20:32:54.2619734Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2619804Z 2025-05-07T20:32:54.2619892Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2620017Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2620142Z x = x_sign * x_clamp 2025-05-07T20:32:54.2620219Z x0 = x[:, :D] 2025-05-07T20:32:54.2620297Z x1 = x[:, D:] 2025-05-07T20:32:54.2620367Z 2025-05-07T20:32:54.2620453Z if contiguous: 2025-05-07T20:32:54.2620539Z x0 = x0.contiguous() 2025-05-07T20:32:54.2620624Z x1 = x1.contiguous() 2025-05-07T20:32:54.2620700Z 2025-05-07T20:32:54.2620785Z if scale_ub is not None: 2025-05-07T20:32:54.2620886Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2621065Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2621137Z ) 2025-05-07T20:32:54.2621210Z else: 2025-05-07T20:32:54.2621303Z scale_ub_tensor = None 2025-05-07T20:32:54.2621374Z 2025-05-07T20:32:54.2621499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2621590Z op = silu_mul_quant 2025-05-07T20:32:54.2621673Z if compiled: 2025-05-07T20:32:54.2621771Z op = torch.compile(op) 2025-05-07T20:32:54.2621877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2621948Z 2025-05-07T20:32:54.2622036Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2622040Z 2025-05-07T20:32:54.2622131Z moe/activation_test.py:117: 2025-05-07T20:32:54.2622254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2622356Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2622453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2622846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2622947Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2623444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2623539Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2623889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2624106Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2624442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2624531Z kernel = self.compile( 2025-05-07T20:32:54.2624904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2625081Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2625205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2625209Z 2025-05-07T20:32:54.2625412Z self = 2025-05-07T20:32:54.2626165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2626660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f4ecea0>} 2025-05-07T20:32:54.2627435Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2627655Z context = 2025-05-07T20:32:54.2627659Z 2025-05-07T20:32:54.2627823Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2628074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2628179Z module_map=module_map) 2025-05-07T20:32:54.2628377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2628471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2628549Z E ^ 2025-05-07T20:32:54.2628895Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2628899Z 2025-05-07T20:32:54.2629303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2629313Z 2025-05-07T20:32:54.2629476Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2629692Z self=, 2025-05-07T20:32:54.2629770Z T=16384, 2025-05-07T20:32:54.2629842Z D=7168, 2025-05-07T20:32:54.2629919Z scale_ub=1200.0, 2025-05-07T20:32:54.2630008Z contiguous=True, 2025-05-07T20:32:54.2630086Z compiled=True, 2025-05-07T20:32:54.2630156Z ) 2025-05-07T20:32:54.2630373Z self = 2025-05-07T20:32:54.2630546Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.2630550Z 2025-05-07T20:32:54.2630624Z @given( 2025-05-07T20:32:54.2630739Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2630836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2630953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2631067Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2631180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2631255Z ) 2025-05-07T20:32:54.2631494Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2631581Z def test_silu_mul_quant( 2025-05-07T20:32:54.2631661Z self, 2025-05-07T20:32:54.2631735Z T: int, 2025-05-07T20:32:54.2631808Z D: int, 2025-05-07T20:32:54.2631908Z scale_ub: Optional[float], 2025-05-07T20:32:54.2631995Z contiguous: bool, 2025-05-07T20:32:54.2632079Z compiled: bool, 2025-05-07T20:32:54.2632151Z ) -> None: 2025-05-07T20:32:54.2632242Z torch.manual_seed(2025) 2025-05-07T20:32:54.2632317Z 2025-05-07T20:32:54.2632483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2632555Z 2025-05-07T20:32:54.2632647Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2632767Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2632856Z x = x_sign * x_clamp 2025-05-07T20:32:54.2632952Z x0 = x[:, :D] 2025-05-07T20:32:54.2633038Z x1 = x[:, D:] 2025-05-07T20:32:54.2633118Z 2025-05-07T20:32:54.2633212Z if contiguous: 2025-05-07T20:32:54.2633299Z x0 = x0.contiguous() 2025-05-07T20:32:54.2633384Z x1 = x1.contiguous() 2025-05-07T20:32:54.2633458Z 2025-05-07T20:32:54.2633544Z if scale_ub is not None: 2025-05-07T20:32:54.2633650Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2633778Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2633848Z ) 2025-05-07T20:32:54.2633924Z else: 2025-05-07T20:32:54.2634017Z scale_ub_tensor = None 2025-05-07T20:32:54.2634086Z 2025-05-07T20:32:54.2634216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2634353Z op = silu_mul_quant 2025-05-07T20:32:54.2634435Z if compiled: 2025-05-07T20:32:54.2634579Z op = torch.compile(op) 2025-05-07T20:32:54.2634680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2634750Z 2025-05-07T20:32:54.2634841Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2634846Z 2025-05-07T20:32:54.2634938Z moe/activation_test.py:117: 2025-05-07T20:32:54.2635065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2635203Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2635303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2635668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2635760Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2636242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2636340Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2636731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2636952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2637280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2637368Z kernel = self.compile( 2025-05-07T20:32:54.2637745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2637917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2638043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2638048Z 2025-05-07T20:32:54.2638249Z self = 2025-05-07T20:32:54.2639012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2639507Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f4ee0c0>} 2025-05-07T20:32:54.2640234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2640427Z context = 2025-05-07T20:32:54.2640432Z 2025-05-07T20:32:54.2640590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2640843Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2640952Z module_map=module_map) 2025-05-07T20:32:54.2641115Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2641216Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2641289Z E ^ 2025-05-07T20:32:54.2641633Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2641638Z 2025-05-07T20:32:54.2642046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2642054Z 2025-05-07T20:32:54.2642153Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2642371Z self=, 2025-05-07T20:32:54.2642445Z T=16384, 2025-05-07T20:32:54.2642520Z D=5120, 2025-05-07T20:32:54.2642603Z scale_ub=1200.0, 2025-05-07T20:32:54.2642729Z contiguous=True, 2025-05-07T20:32:54.2642810Z compiled=False, 2025-05-07T20:32:54.2642880Z ) 2025-05-07T20:32:54.2643177Z self = 2025-05-07T20:32:54.2643356Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2643360Z 2025-05-07T20:32:54.2643436Z @given( 2025-05-07T20:32:54.2643551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2643650Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2643765Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2643920Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2644035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2644104Z ) 2025-05-07T20:32:54.2644343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2644437Z def test_silu_mul_quant( 2025-05-07T20:32:54.2644513Z self, 2025-05-07T20:32:54.2644589Z T: int, 2025-05-07T20:32:54.2644666Z D: int, 2025-05-07T20:32:54.2644762Z scale_ub: Optional[float], 2025-05-07T20:32:54.2644886Z contiguous: bool, 2025-05-07T20:32:54.2644972Z compiled: bool, 2025-05-07T20:32:54.2645047Z ) -> None: 2025-05-07T20:32:54.2645141Z torch.manual_seed(2025) 2025-05-07T20:32:54.2645210Z 2025-05-07T20:32:54.2645374Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2645450Z 2025-05-07T20:32:54.2645539Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2645661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2645748Z x = x_sign * x_clamp 2025-05-07T20:32:54.2645825Z x0 = x[:, :D] 2025-05-07T20:32:54.2645900Z x1 = x[:, D:] 2025-05-07T20:32:54.2645975Z 2025-05-07T20:32:54.2646054Z if contiguous: 2025-05-07T20:32:54.2646142Z x0 = x0.contiguous() 2025-05-07T20:32:54.2646236Z x1 = x1.contiguous() 2025-05-07T20:32:54.2646306Z 2025-05-07T20:32:54.2646393Z if scale_ub is not None: 2025-05-07T20:32:54.2646503Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2646633Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2646708Z ) 2025-05-07T20:32:54.2646780Z else: 2025-05-07T20:32:54.2646869Z scale_ub_tensor = None 2025-05-07T20:32:54.2646942Z 2025-05-07T20:32:54.2647066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2647154Z op = silu_mul_quant 2025-05-07T20:32:54.2647237Z if compiled: 2025-05-07T20:32:54.2647333Z op = torch.compile(op) 2025-05-07T20:32:54.2647432Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2647507Z 2025-05-07T20:32:54.2647592Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2647596Z 2025-05-07T20:32:54.2647690Z moe/activation_test.py:117: 2025-05-07T20:32:54.2647816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2647914Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2648016Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2648505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2648602Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2648952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2649172Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2649503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2649595Z kernel = self.compile( 2025-05-07T20:32:54.2649967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2650189Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2650432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2650437Z 2025-05-07T20:32:54.2650639Z self = 2025-05-07T20:32:54.2651396Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2651924Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79f4eda80>} 2025-05-07T20:32:54.2652657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2652847Z context = 2025-05-07T20:32:54.2652889Z 2025-05-07T20:32:54.2653098Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2653350Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2653453Z module_map=module_map) 2025-05-07T20:32:54.2653612Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2653709Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2653789Z E ^ 2025-05-07T20:32:54.2654136Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2654140Z 2025-05-07T20:32:54.2654543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2654550Z 2025-05-07T20:32:54.2654649Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2654869Z self=, 2025-05-07T20:32:54.2654943Z T=1, 2025-05-07T20:32:54.2655018Z D=7168, 2025-05-07T20:32:54.2655095Z scale_ub=1200.0, 2025-05-07T20:32:54.2655180Z contiguous=False, 2025-05-07T20:32:54.2655265Z compiled=False, 2025-05-07T20:32:54.2655335Z ) 2025-05-07T20:32:54.2655551Z self = 2025-05-07T20:32:54.2655712Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.2655719Z 2025-05-07T20:32:54.2655793Z @given( 2025-05-07T20:32:54.2655910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2656006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2656114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2656229Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2656341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2656409Z ) 2025-05-07T20:32:54.2656654Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2656742Z def test_silu_mul_quant( 2025-05-07T20:32:54.2656817Z self, 2025-05-07T20:32:54.2656889Z T: int, 2025-05-07T20:32:54.2656959Z D: int, 2025-05-07T20:32:54.2657058Z scale_ub: Optional[float], 2025-05-07T20:32:54.2657144Z contiguous: bool, 2025-05-07T20:32:54.2657226Z compiled: bool, 2025-05-07T20:32:54.2657306Z ) -> None: 2025-05-07T20:32:54.2657397Z torch.manual_seed(2025) 2025-05-07T20:32:54.2657471Z 2025-05-07T20:32:54.2657640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2657708Z 2025-05-07T20:32:54.2657794Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2657916Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2658048Z x = x_sign * x_clamp 2025-05-07T20:32:54.2658126Z x0 = x[:, :D] 2025-05-07T20:32:54.2658244Z x1 = x[:, D:] 2025-05-07T20:32:54.2658315Z 2025-05-07T20:32:54.2658397Z if contiguous: 2025-05-07T20:32:54.2658484Z x0 = x0.contiguous() 2025-05-07T20:32:54.2658569Z x1 = x1.contiguous() 2025-05-07T20:32:54.2658643Z 2025-05-07T20:32:54.2658730Z if scale_ub is not None: 2025-05-07T20:32:54.2658831Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2658963Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2659096Z ) 2025-05-07T20:32:54.2659437Z else: 2025-05-07T20:32:54.2659536Z scale_ub_tensor = None 2025-05-07T20:32:54.2659605Z 2025-05-07T20:32:54.2659735Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2659820Z op = silu_mul_quant 2025-05-07T20:32:54.2659900Z if compiled: 2025-05-07T20:32:54.2660002Z op = torch.compile(op) 2025-05-07T20:32:54.2660103Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2660172Z 2025-05-07T20:32:54.2660330Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2660335Z 2025-05-07T20:32:54.2660429Z moe/activation_test.py:117: 2025-05-07T20:32:54.2660556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2660653Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2660747Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2661239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2661337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2661686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2661903Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2662238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2662331Z kernel = self.compile( 2025-05-07T20:32:54.2662705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2662873Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2663000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2663008Z 2025-05-07T20:32:54.2663207Z self = 2025-05-07T20:32:54.2663963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2664468Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ef680e0>} 2025-05-07T20:32:54.2668353Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2668561Z context = 2025-05-07T20:32:54.2668566Z 2025-05-07T20:32:54.2668729Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2668986Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2669093Z module_map=module_map) 2025-05-07T20:32:54.2669249Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2669346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2669514Z E ^ 2025-05-07T20:32:54.2669863Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2669929Z 2025-05-07T20:32:54.2670347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2670352Z 2025-05-07T20:32:54.2670452Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2670675Z self=, 2025-05-07T20:32:54.2670746Z T=4096, 2025-05-07T20:32:54.2670889Z D=7168, 2025-05-07T20:32:54.2670974Z scale_ub=1200.0, 2025-05-07T20:32:54.2671058Z contiguous=False, 2025-05-07T20:32:54.2671139Z compiled=True, 2025-05-07T20:32:54.2671213Z ) 2025-05-07T20:32:54.2671425Z self = 2025-05-07T20:32:54.2671593Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2671601Z 2025-05-07T20:32:54.2671681Z @given( 2025-05-07T20:32:54.2671799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2671936Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2672051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2672163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2672274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2672343Z ) 2025-05-07T20:32:54.2672582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2672676Z def test_silu_mul_quant( 2025-05-07T20:32:54.2672752Z self, 2025-05-07T20:32:54.2672824Z T: int, 2025-05-07T20:32:54.2672903Z D: int, 2025-05-07T20:32:54.2672997Z scale_ub: Optional[float], 2025-05-07T20:32:54.2673083Z contiguous: bool, 2025-05-07T20:32:54.2673168Z compiled: bool, 2025-05-07T20:32:54.2673245Z ) -> None: 2025-05-07T20:32:54.2673339Z torch.manual_seed(2025) 2025-05-07T20:32:54.2673411Z 2025-05-07T20:32:54.2673578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2673659Z 2025-05-07T20:32:54.2673748Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2673873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2673963Z x = x_sign * x_clamp 2025-05-07T20:32:54.2674041Z x0 = x[:, :D] 2025-05-07T20:32:54.2674116Z x1 = x[:, D:] 2025-05-07T20:32:54.2674186Z 2025-05-07T20:32:54.2674265Z if contiguous: 2025-05-07T20:32:54.2674356Z x0 = x0.contiguous() 2025-05-07T20:32:54.2674443Z x1 = x1.contiguous() 2025-05-07T20:32:54.2674511Z 2025-05-07T20:32:54.2674600Z if scale_ub is not None: 2025-05-07T20:32:54.2674704Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2674835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2674914Z ) 2025-05-07T20:32:54.2674988Z else: 2025-05-07T20:32:54.2675078Z scale_ub_tensor = None 2025-05-07T20:32:54.2675152Z 2025-05-07T20:32:54.2675282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2675370Z op = silu_mul_quant 2025-05-07T20:32:54.2675454Z if compiled: 2025-05-07T20:32:54.2675553Z op = torch.compile(op) 2025-05-07T20:32:54.2675653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2675723Z 2025-05-07T20:32:54.2675809Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2675817Z 2025-05-07T20:32:54.2675910Z moe/activation_test.py:117: 2025-05-07T20:32:54.2676037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2676135Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2676231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2676590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2676728Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2677256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2677351Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2677700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2677922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2678294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2678386Z kernel = self.compile( 2025-05-07T20:32:54.2678756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2678925Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2679054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2679058Z 2025-05-07T20:32:54.2679298Z self = 2025-05-07T20:32:54.2680060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2680552Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ef69300>} 2025-05-07T20:32:54.2681292Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2681478Z context = 2025-05-07T20:32:54.2681485Z 2025-05-07T20:32:54.2681644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2681907Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2682012Z module_map=module_map) 2025-05-07T20:32:54.2682170Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2682265Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2682339Z E ^ 2025-05-07T20:32:54.2682685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2682693Z 2025-05-07T20:32:54.2683134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2683140Z 2025-05-07T20:32:54.2683252Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2683477Z self=, 2025-05-07T20:32:54.2683552Z T=128, 2025-05-07T20:32:54.2683630Z D=7168, 2025-05-07T20:32:54.2683715Z scale_ub=1200.0, 2025-05-07T20:32:54.2683796Z contiguous=False, 2025-05-07T20:32:54.2683878Z compiled=True, 2025-05-07T20:32:54.2683946Z ) 2025-05-07T20:32:54.2684157Z self = 2025-05-07T20:32:54.2684328Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.2684333Z 2025-05-07T20:32:54.2684410Z @given( 2025-05-07T20:32:54.2684524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2684620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2684733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2684850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2684960Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2685080Z ) 2025-05-07T20:32:54.2685320Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2685449Z def test_silu_mul_quant( 2025-05-07T20:32:54.2685531Z self, 2025-05-07T20:32:54.2685607Z T: int, 2025-05-07T20:32:54.2685682Z D: int, 2025-05-07T20:32:54.2685778Z scale_ub: Optional[float], 2025-05-07T20:32:54.2685866Z contiguous: bool, 2025-05-07T20:32:54.2685947Z compiled: bool, 2025-05-07T20:32:54.2686019Z ) -> None: 2025-05-07T20:32:54.2686113Z torch.manual_seed(2025) 2025-05-07T20:32:54.2686223Z 2025-05-07T20:32:54.2686386Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2686457Z 2025-05-07T20:32:54.2686548Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2686672Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2686758Z x = x_sign * x_clamp 2025-05-07T20:32:54.2686835Z x0 = x[:, :D] 2025-05-07T20:32:54.2686916Z x1 = x[:, D:] 2025-05-07T20:32:54.2686986Z 2025-05-07T20:32:54.2687066Z if contiguous: 2025-05-07T20:32:54.2687199Z x0 = x0.contiguous() 2025-05-07T20:32:54.2687285Z x1 = x1.contiguous() 2025-05-07T20:32:54.2687354Z 2025-05-07T20:32:54.2687444Z if scale_ub is not None: 2025-05-07T20:32:54.2687548Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2687677Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2687750Z ) 2025-05-07T20:32:54.2687821Z else: 2025-05-07T20:32:54.2687917Z scale_ub_tensor = None 2025-05-07T20:32:54.2687986Z 2025-05-07T20:32:54.2688111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2688202Z op = silu_mul_quant 2025-05-07T20:32:54.2688282Z if compiled: 2025-05-07T20:32:54.2688375Z op = torch.compile(op) 2025-05-07T20:32:54.2688478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2688550Z 2025-05-07T20:32:54.2688636Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2688640Z 2025-05-07T20:32:54.2688739Z moe/activation_test.py:117: 2025-05-07T20:32:54.2688864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2688962Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2689057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2689414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2689508Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2689990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2690084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2690436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2690655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2690993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2691087Z kernel = self.compile( 2025-05-07T20:32:54.2691457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2691633Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2691755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2691762Z 2025-05-07T20:32:54.2691962Z self = 2025-05-07T20:32:54.2692726Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2693459Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ef6a020>} 2025-05-07T20:32:54.2694192Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2694377Z context = 2025-05-07T20:32:54.2694420Z 2025-05-07T20:32:54.2694583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2694834Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2694937Z module_map=module_map) 2025-05-07T20:32:54.2695098Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2695194Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2695269Z E ^ 2025-05-07T20:32:54.2695659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2695664Z 2025-05-07T20:32:54.2696068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2696073Z 2025-05-07T20:32:54.2696176Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2696393Z self=, 2025-05-07T20:32:54.2696470Z T=2048, 2025-05-07T20:32:54.2696544Z D=7168, 2025-05-07T20:32:54.2696622Z scale_ub=None, 2025-05-07T20:32:54.2696704Z contiguous=True, 2025-05-07T20:32:54.2696786Z compiled=True, 2025-05-07T20:32:54.2696857Z ) 2025-05-07T20:32:54.2697072Z self = 2025-05-07T20:32:54.2697239Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2697243Z 2025-05-07T20:32:54.2697320Z @given( 2025-05-07T20:32:54.2697444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2697541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2697651Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2697768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2697877Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2697953Z ) 2025-05-07T20:32:54.2698194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2698284Z def test_silu_mul_quant( 2025-05-07T20:32:54.2698362Z self, 2025-05-07T20:32:54.2698433Z T: int, 2025-05-07T20:32:54.2698505Z D: int, 2025-05-07T20:32:54.2698604Z scale_ub: Optional[float], 2025-05-07T20:32:54.2698690Z contiguous: bool, 2025-05-07T20:32:54.2698774Z compiled: bool, 2025-05-07T20:32:54.2698850Z ) -> None: 2025-05-07T20:32:54.2698943Z torch.manual_seed(2025) 2025-05-07T20:32:54.2699013Z 2025-05-07T20:32:54.2699184Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2699253Z 2025-05-07T20:32:54.2699339Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2699461Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2699546Z x = x_sign * x_clamp 2025-05-07T20:32:54.2699626Z x0 = x[:, :D] 2025-05-07T20:32:54.2699702Z x1 = x[:, D:] 2025-05-07T20:32:54.2699774Z 2025-05-07T20:32:54.2699855Z if contiguous: 2025-05-07T20:32:54.2699943Z x0 = x0.contiguous() 2025-05-07T20:32:54.2700031Z x1 = x1.contiguous() 2025-05-07T20:32:54.2700104Z 2025-05-07T20:32:54.2700193Z if scale_ub is not None: 2025-05-07T20:32:54.2700293Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2700424Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2700547Z ) 2025-05-07T20:32:54.2700620Z else: 2025-05-07T20:32:54.2700753Z scale_ub_tensor = None 2025-05-07T20:32:54.2700827Z 2025-05-07T20:32:54.2700955Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2701040Z op = silu_mul_quant 2025-05-07T20:32:54.2701123Z if compiled: 2025-05-07T20:32:54.2701220Z op = torch.compile(op) 2025-05-07T20:32:54.2701321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2701390Z 2025-05-07T20:32:54.2701520Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2701524Z 2025-05-07T20:32:54.2701619Z moe/activation_test.py:117: 2025-05-07T20:32:54.2701744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2701845Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2701940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2702303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2702391Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2702914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2703009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2703408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2703623Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2703959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2704048Z kernel = self.compile( 2025-05-07T20:32:54.2704422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2704594Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2704723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2704729Z 2025-05-07T20:32:54.2704932Z self = 2025-05-07T20:32:54.2705687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2706187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ef6b240>} 2025-05-07T20:32:54.2706913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2707099Z context = 2025-05-07T20:32:54.2707106Z 2025-05-07T20:32:54.2707271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2707528Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2707633Z module_map=module_map) 2025-05-07T20:32:54.2707790Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2707882Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2707963Z E ^ 2025-05-07T20:32:54.2708308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2708313Z 2025-05-07T20:32:54.2708717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2708722Z 2025-05-07T20:32:54.2708863Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2709079Z self=, 2025-05-07T20:32:54.2709192Z T=16384, 2025-05-07T20:32:54.2709268Z D=5120, 2025-05-07T20:32:54.2709344Z scale_ub=None, 2025-05-07T20:32:54.2709429Z contiguous=False, 2025-05-07T20:32:54.2709510Z compiled=False, 2025-05-07T20:32:54.2709581Z ) 2025-05-07T20:32:54.2709795Z self = 2025-05-07T20:32:54.2709968Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2710013Z 2025-05-07T20:32:54.2710087Z @given( 2025-05-07T20:32:54.2710203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2710299Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2710413Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2710528Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2710639Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2710712Z ) 2025-05-07T20:32:54.2710997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2711091Z def test_silu_mul_quant( 2025-05-07T20:32:54.2711166Z self, 2025-05-07T20:32:54.2711240Z T: int, 2025-05-07T20:32:54.2711317Z D: int, 2025-05-07T20:32:54.2711413Z scale_ub: Optional[float], 2025-05-07T20:32:54.2711500Z contiguous: bool, 2025-05-07T20:32:54.2711582Z compiled: bool, 2025-05-07T20:32:54.2711656Z ) -> None: 2025-05-07T20:32:54.2711751Z torch.manual_seed(2025) 2025-05-07T20:32:54.2711820Z 2025-05-07T20:32:54.2711984Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2712056Z 2025-05-07T20:32:54.2712147Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2712266Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2714059Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2714067Z 2025-05-07T20:32:54.2714184Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.2714189Z 2025-05-07T20:32:54.2714290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2714505Z self=, 2025-05-07T20:32:54.2714578Z T=4096, 2025-05-07T20:32:54.2714654Z D=7168, 2025-05-07T20:32:54.2714734Z scale_ub=1200.0, 2025-05-07T20:32:54.2714817Z contiguous=True, 2025-05-07T20:32:54.2714897Z compiled=True, 2025-05-07T20:32:54.2714967Z ) 2025-05-07T20:32:54.2715183Z self = 2025-05-07T20:32:54.2715348Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.2715353Z 2025-05-07T20:32:54.2715423Z @given( 2025-05-07T20:32:54.2715543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2715640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2715750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2715865Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2715972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2716044Z ) 2025-05-07T20:32:54.2716285Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2716374Z def test_silu_mul_quant( 2025-05-07T20:32:54.2716497Z self, 2025-05-07T20:32:54.2716572Z T: int, 2025-05-07T20:32:54.2716646Z D: int, 2025-05-07T20:32:54.2716742Z scale_ub: Optional[float], 2025-05-07T20:32:54.2716867Z contiguous: bool, 2025-05-07T20:32:54.2716951Z compiled: bool, 2025-05-07T20:32:54.2717027Z ) -> None: 2025-05-07T20:32:54.2717117Z torch.manual_seed(2025) 2025-05-07T20:32:54.2717187Z 2025-05-07T20:32:54.2717353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2717425Z 2025-05-07T20:32:54.2717516Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2717681Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2719473Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2719487Z 2025-05-07T20:32:54.2719601Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.2719605Z 2025-05-07T20:32:54.2719703Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2719918Z self=, 2025-05-07T20:32:54.2719996Z T=16384, 2025-05-07T20:32:54.2720069Z D=7168, 2025-05-07T20:32:54.2720147Z scale_ub=None, 2025-05-07T20:32:54.2720231Z contiguous=False, 2025-05-07T20:32:54.2720309Z compiled=False, 2025-05-07T20:32:54.2720380Z ) 2025-05-07T20:32:54.2720590Z self = 2025-05-07T20:32:54.2720764Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2720771Z 2025-05-07T20:32:54.2720845Z @given( 2025-05-07T20:32:54.2720960Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2721059Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2721168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2721276Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2721386Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2721459Z ) 2025-05-07T20:32:54.2721699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2721791Z def test_silu_mul_quant( 2025-05-07T20:32:54.2721862Z self, 2025-05-07T20:32:54.2721938Z T: int, 2025-05-07T20:32:54.2722010Z D: int, 2025-05-07T20:32:54.2722104Z scale_ub: Optional[float], 2025-05-07T20:32:54.2722193Z contiguous: bool, 2025-05-07T20:32:54.2722273Z compiled: bool, 2025-05-07T20:32:54.2722349Z ) -> None: 2025-05-07T20:32:54.2722442Z torch.manual_seed(2025) 2025-05-07T20:32:54.2722509Z 2025-05-07T20:32:54.2722673Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2724425Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2724434Z 2025-05-07T20:32:54.2724547Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2724551Z 2025-05-07T20:32:54.2724653Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2724939Z self=, 2025-05-07T20:32:54.2725015Z T=2048, 2025-05-07T20:32:54.2725130Z D=7168, 2025-05-07T20:32:54.2725212Z scale_ub=1200.0, 2025-05-07T20:32:54.2725295Z contiguous=True, 2025-05-07T20:32:54.2725373Z compiled=True, 2025-05-07T20:32:54.2725444Z ) 2025-05-07T20:32:54.2725656Z self = 2025-05-07T20:32:54.2725821Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.2725865Z 2025-05-07T20:32:54.2725936Z @given( 2025-05-07T20:32:54.2726053Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2726149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2726262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2726373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2726481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2726564Z ) 2025-05-07T20:32:54.2726810Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2726940Z def test_silu_mul_quant( 2025-05-07T20:32:54.2727019Z self, 2025-05-07T20:32:54.2727093Z T: int, 2025-05-07T20:32:54.2727164Z D: int, 2025-05-07T20:32:54.2727262Z scale_ub: Optional[float], 2025-05-07T20:32:54.2727346Z contiguous: bool, 2025-05-07T20:32:54.2727426Z compiled: bool, 2025-05-07T20:32:54.2727503Z ) -> None: 2025-05-07T20:32:54.2727597Z torch.manual_seed(2025) 2025-05-07T20:32:54.2727663Z 2025-05-07T20:32:54.2727827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2727899Z 2025-05-07T20:32:54.2727988Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2728112Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2729852Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2729864Z 2025-05-07T20:32:54.2729976Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.2729984Z 2025-05-07T20:32:54.2730082Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2730303Z self=, 2025-05-07T20:32:54.2730375Z T=2048, 2025-05-07T20:32:54.2730450Z D=7168, 2025-05-07T20:32:54.2730527Z scale_ub=None, 2025-05-07T20:32:54.2730607Z contiguous=True, 2025-05-07T20:32:54.2730688Z compiled=False, 2025-05-07T20:32:54.2730763Z ) 2025-05-07T20:32:54.2730974Z self = 2025-05-07T20:32:54.2731145Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2731149Z 2025-05-07T20:32:54.2731221Z @given( 2025-05-07T20:32:54.2731334Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2731431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2731538Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2731652Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2731763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2731834Z ) 2025-05-07T20:32:54.2732072Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2732163Z def test_silu_mul_quant( 2025-05-07T20:32:54.2732232Z self, 2025-05-07T20:32:54.2732354Z T: int, 2025-05-07T20:32:54.2732428Z D: int, 2025-05-07T20:32:54.2732520Z scale_ub: Optional[float], 2025-05-07T20:32:54.2732647Z contiguous: bool, 2025-05-07T20:32:54.2732732Z compiled: bool, 2025-05-07T20:32:54.2732806Z ) -> None: 2025-05-07T20:32:54.2732900Z torch.manual_seed(2025) 2025-05-07T20:32:54.2733048Z 2025-05-07T20:32:54.2733235Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2733311Z 2025-05-07T20:32:54.2733401Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.2735143Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2735197Z 2025-05-07T20:32:54.2735347Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.2735353Z 2025-05-07T20:32:54.2735450Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2735669Z self=, 2025-05-07T20:32:54.2735739Z T=1, 2025-05-07T20:32:54.2735811Z D=7168, 2025-05-07T20:32:54.2735889Z scale_ub=1200.0, 2025-05-07T20:32:54.2735971Z contiguous=True, 2025-05-07T20:32:54.2736056Z compiled=False, 2025-05-07T20:32:54.2736124Z ) 2025-05-07T20:32:54.2736334Z self = 2025-05-07T20:32:54.2736496Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2736500Z 2025-05-07T20:32:54.2736572Z @given( 2025-05-07T20:32:54.2736689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2736786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2736899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2737013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2737121Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2737188Z ) 2025-05-07T20:32:54.2737432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2737520Z def test_silu_mul_quant( 2025-05-07T20:32:54.2737596Z self, 2025-05-07T20:32:54.2737675Z T: int, 2025-05-07T20:32:54.2737746Z D: int, 2025-05-07T20:32:54.2737837Z scale_ub: Optional[float], 2025-05-07T20:32:54.2737925Z contiguous: bool, 2025-05-07T20:32:54.2738005Z compiled: bool, 2025-05-07T20:32:54.2738078Z ) -> None: 2025-05-07T20:32:54.2738174Z torch.manual_seed(2025) 2025-05-07T20:32:54.2738244Z 2025-05-07T20:32:54.2738412Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2738483Z 2025-05-07T20:32:54.2738573Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2738701Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2738786Z x = x_sign * x_clamp 2025-05-07T20:32:54.2738860Z x0 = x[:, :D] 2025-05-07T20:32:54.2738938Z x1 = x[:, D:] 2025-05-07T20:32:54.2739010Z 2025-05-07T20:32:54.2739089Z if contiguous: 2025-05-07T20:32:54.2739180Z x0 = x0.contiguous() 2025-05-07T20:32:54.2739268Z x1 = x1.contiguous() 2025-05-07T20:32:54.2739336Z 2025-05-07T20:32:54.2739426Z if scale_ub is not None: 2025-05-07T20:32:54.2739527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2739658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2739730Z ) 2025-05-07T20:32:54.2739801Z else: 2025-05-07T20:32:54.2739942Z scale_ub_tensor = None 2025-05-07T20:32:54.2740008Z 2025-05-07T20:32:54.2740133Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2740268Z op = silu_mul_quant 2025-05-07T20:32:54.2740349Z if compiled: 2025-05-07T20:32:54.2740444Z op = torch.compile(op) 2025-05-07T20:32:54.2740548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2740615Z 2025-05-07T20:32:54.2740700Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2740704Z 2025-05-07T20:32:54.2740802Z moe/activation_test.py:117: 2025-05-07T20:32:54.2740966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2741064Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2741159Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2741650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2741748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2742108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2742363Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2742700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2742789Z kernel = self.compile( 2025-05-07T20:32:54.2743163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2743336Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2743459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2743463Z 2025-05-07T20:32:54.2743668Z self = 2025-05-07T20:32:54.2744431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2744928Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79eca2520>} 2025-05-07T20:32:54.2745657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2745852Z context = 2025-05-07T20:32:54.2745857Z 2025-05-07T20:32:54.2746019Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2746275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2746382Z module_map=module_map) 2025-05-07T20:32:54.2746538Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2746634Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2746712Z E ^ 2025-05-07T20:32:54.2747057Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2747062Z 2025-05-07T20:32:54.2747467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2747474Z 2025-05-07T20:32:54.2747572Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2747787Z self=, 2025-05-07T20:32:54.2747860Z T=128, 2025-05-07T20:32:54.2747932Z D=5120, 2025-05-07T20:32:54.2748012Z scale_ub=None, 2025-05-07T20:32:54.2748097Z contiguous=True, 2025-05-07T20:32:54.2748178Z compiled=False, 2025-05-07T20:32:54.2748293Z ) 2025-05-07T20:32:54.2748509Z self = 2025-05-07T20:32:54.2748714Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2748719Z 2025-05-07T20:32:54.2748799Z @given( 2025-05-07T20:32:54.2748914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2749008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2749122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2749234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2749383Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2749455Z ) 2025-05-07T20:32:54.2749693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2749784Z def test_silu_mul_quant( 2025-05-07T20:32:54.2749856Z self, 2025-05-07T20:32:54.2749930Z T: int, 2025-05-07T20:32:54.2750008Z D: int, 2025-05-07T20:32:54.2750101Z scale_ub: Optional[float], 2025-05-07T20:32:54.2750186Z contiguous: bool, 2025-05-07T20:32:54.2750333Z compiled: bool, 2025-05-07T20:32:54.2750410Z ) -> None: 2025-05-07T20:32:54.2750499Z torch.manual_seed(2025) 2025-05-07T20:32:54.2750571Z 2025-05-07T20:32:54.2750733Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2750801Z 2025-05-07T20:32:54.2750893Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2751015Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2751108Z x = x_sign * x_clamp 2025-05-07T20:32:54.2751185Z x0 = x[:, :D] 2025-05-07T20:32:54.2751259Z x1 = x[:, D:] 2025-05-07T20:32:54.2751330Z 2025-05-07T20:32:54.2751408Z if contiguous: 2025-05-07T20:32:54.2751493Z x0 = x0.contiguous() 2025-05-07T20:32:54.2751581Z x1 = x1.contiguous() 2025-05-07T20:32:54.2751649Z 2025-05-07T20:32:54.2751736Z if scale_ub is not None: 2025-05-07T20:32:54.2751840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2751973Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2752044Z ) 2025-05-07T20:32:54.2752121Z else: 2025-05-07T20:32:54.2752211Z scale_ub_tensor = None 2025-05-07T20:32:54.2752282Z 2025-05-07T20:32:54.2752409Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2752496Z op = silu_mul_quant 2025-05-07T20:32:54.2752580Z if compiled: 2025-05-07T20:32:54.2752679Z op = torch.compile(op) 2025-05-07T20:32:54.2752780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2752852Z 2025-05-07T20:32:54.2752950Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2752956Z 2025-05-07T20:32:54.2753057Z moe/activation_test.py:117: 2025-05-07T20:32:54.2753211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2753311Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2753407Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2753903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2753997Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2754350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2754565Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2754897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2754988Z kernel = self.compile( 2025-05-07T20:32:54.2755360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2755530Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2755703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2755747Z 2025-05-07T20:32:54.2755953Z self = 2025-05-07T20:32:54.2756712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2757206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79eca3420>} 2025-05-07T20:32:54.2757980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2758168Z context = 2025-05-07T20:32:54.2758172Z 2025-05-07T20:32:54.2758370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2758626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2758728Z module_map=module_map) 2025-05-07T20:32:54.2758885Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2758980Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2759053Z E ^ 2025-05-07T20:32:54.2759653Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2759658Z 2025-05-07T20:32:54.2760065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2760070Z 2025-05-07T20:32:54.2760169Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2760388Z self=, 2025-05-07T20:32:54.2760459Z T=128, 2025-05-07T20:32:54.2760540Z D=7168, 2025-05-07T20:32:54.2760619Z scale_ub=None, 2025-05-07T20:32:54.2760697Z contiguous=True, 2025-05-07T20:32:54.2760781Z compiled=False, 2025-05-07T20:32:54.2760848Z ) 2025-05-07T20:32:54.2761059Z self = 2025-05-07T20:32:54.2761227Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2761234Z 2025-05-07T20:32:54.2761306Z @given( 2025-05-07T20:32:54.2761428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2761525Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2761634Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2761748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2761858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2761929Z ) 2025-05-07T20:32:54.2762175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2762267Z def test_silu_mul_quant( 2025-05-07T20:32:54.2762337Z self, 2025-05-07T20:32:54.2762413Z T: int, 2025-05-07T20:32:54.2762485Z D: int, 2025-05-07T20:32:54.2762578Z scale_ub: Optional[float], 2025-05-07T20:32:54.2762667Z contiguous: bool, 2025-05-07T20:32:54.2762750Z compiled: bool, 2025-05-07T20:32:54.2762828Z ) -> None: 2025-05-07T20:32:54.2762924Z torch.manual_seed(2025) 2025-05-07T20:32:54.2762994Z 2025-05-07T20:32:54.2763161Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2763231Z 2025-05-07T20:32:54.2763324Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2763448Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2763532Z x = x_sign * x_clamp 2025-05-07T20:32:54.2763686Z x0 = x[:, :D] 2025-05-07T20:32:54.2763769Z x1 = x[:, D:] 2025-05-07T20:32:54.2763837Z 2025-05-07T20:32:54.2763973Z if contiguous: 2025-05-07T20:32:54.2764066Z x0 = x0.contiguous() 2025-05-07T20:32:54.2764152Z x1 = x1.contiguous() 2025-05-07T20:32:54.2764222Z 2025-05-07T20:32:54.2764309Z if scale_ub is not None: 2025-05-07T20:32:54.2764409Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2764541Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2764609Z ) 2025-05-07T20:32:54.2764742Z else: 2025-05-07T20:32:54.2764834Z scale_ub_tensor = None 2025-05-07T20:32:54.2764904Z 2025-05-07T20:32:54.2765029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2765118Z op = silu_mul_quant 2025-05-07T20:32:54.2765196Z if compiled: 2025-05-07T20:32:54.2765290Z op = torch.compile(op) 2025-05-07T20:32:54.2765396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2765465Z 2025-05-07T20:32:54.2765551Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2765562Z 2025-05-07T20:32:54.2765709Z moe/activation_test.py:117: 2025-05-07T20:32:54.2765837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2765935Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2766030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2766518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2766616Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2766965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2767184Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2767514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2767607Z kernel = self.compile( 2025-05-07T20:32:54.2767986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2768156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2768277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2768282Z 2025-05-07T20:32:54.2768484Z self = 2025-05-07T20:32:54.2769242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2769736Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ea944a0>} 2025-05-07T20:32:54.2770468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2770658Z context = 2025-05-07T20:32:54.2770662Z 2025-05-07T20:32:54.2770822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2771073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2771181Z module_map=module_map) 2025-05-07T20:32:54.2771338Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2771434Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2771509Z E ^ 2025-05-07T20:32:54.2771854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2771903Z 2025-05-07T20:32:54.2772350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2772356Z 2025-05-07T20:32:54.2772455Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2772672Z self=, 2025-05-07T20:32:54.2772743Z T=2048, 2025-05-07T20:32:54.2772817Z D=7168, 2025-05-07T20:32:54.2772898Z scale_ub=1200.0, 2025-05-07T20:32:54.2773078Z contiguous=True, 2025-05-07T20:32:54.2773160Z compiled=False, 2025-05-07T20:32:54.2773233Z ) 2025-05-07T20:32:54.2773447Z self = 2025-05-07T20:32:54.2773617Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2773622Z 2025-05-07T20:32:54.2773698Z @given( 2025-05-07T20:32:54.2773816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2773912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2774070Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2774184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2774300Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2774368Z ) 2025-05-07T20:32:54.2774606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2774699Z def test_silu_mul_quant( 2025-05-07T20:32:54.2774776Z self, 2025-05-07T20:32:54.2774848Z T: int, 2025-05-07T20:32:54.2774927Z D: int, 2025-05-07T20:32:54.2775021Z scale_ub: Optional[float], 2025-05-07T20:32:54.2775107Z contiguous: bool, 2025-05-07T20:32:54.2775192Z compiled: bool, 2025-05-07T20:32:54.2775266Z ) -> None: 2025-05-07T20:32:54.2775357Z torch.manual_seed(2025) 2025-05-07T20:32:54.2775430Z 2025-05-07T20:32:54.2775593Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2777346Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2777354Z 2025-05-07T20:32:54.2777467Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2777472Z 2025-05-07T20:32:54.2777573Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2777789Z self=, 2025-05-07T20:32:54.2777863Z T=1, 2025-05-07T20:32:54.2777940Z D=5120, 2025-05-07T20:32:54.2778020Z scale_ub=1200.0, 2025-05-07T20:32:54.2778104Z contiguous=True, 2025-05-07T20:32:54.2778187Z compiled=False, 2025-05-07T20:32:54.2778257Z ) 2025-05-07T20:32:54.2778466Z self = 2025-05-07T20:32:54.2778629Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2778634Z 2025-05-07T20:32:54.2778709Z @given( 2025-05-07T20:32:54.2778824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2778924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2779033Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2779148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2779257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2779328Z ) 2025-05-07T20:32:54.2779568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2779751Z def test_silu_mul_quant( 2025-05-07T20:32:54.2779827Z self, 2025-05-07T20:32:54.2779944Z T: int, 2025-05-07T20:32:54.2780023Z D: int, 2025-05-07T20:32:54.2780120Z scale_ub: Optional[float], 2025-05-07T20:32:54.2780206Z contiguous: bool, 2025-05-07T20:32:54.2780288Z compiled: bool, 2025-05-07T20:32:54.2780367Z ) -> None: 2025-05-07T20:32:54.2780462Z torch.manual_seed(2025) 2025-05-07T20:32:54.2780530Z 2025-05-07T20:32:54.2780695Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2780834Z 2025-05-07T20:32:54.2780923Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2781048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2781133Z x = x_sign * x_clamp 2025-05-07T20:32:54.2781209Z x0 = x[:, :D] 2025-05-07T20:32:54.2781288Z x1 = x[:, D:] 2025-05-07T20:32:54.2781365Z 2025-05-07T20:32:54.2781445Z if contiguous: 2025-05-07T20:32:54.2781535Z x0 = x0.contiguous() 2025-05-07T20:32:54.2781624Z x1 = x1.contiguous() 2025-05-07T20:32:54.2781732Z 2025-05-07T20:32:54.2781820Z if scale_ub is not None: 2025-05-07T20:32:54.2781922Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2782054Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2782125Z ) 2025-05-07T20:32:54.2782196Z else: 2025-05-07T20:32:54.2782291Z scale_ub_tensor = None 2025-05-07T20:32:54.2782365Z 2025-05-07T20:32:54.2782491Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2782585Z op = silu_mul_quant 2025-05-07T20:32:54.2782671Z if compiled: 2025-05-07T20:32:54.2782767Z op = torch.compile(op) 2025-05-07T20:32:54.2782877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2786052Z 2025-05-07T20:32:54.2786160Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2786165Z 2025-05-07T20:32:54.2786268Z moe/activation_test.py:117: 2025-05-07T20:32:54.2786405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2786508Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2786607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2787108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2787208Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2787563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2787782Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2788124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2788215Z kernel = self.compile( 2025-05-07T20:32:54.2788596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2788769Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2788894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2788899Z 2025-05-07T20:32:54.2789107Z self = 2025-05-07T20:32:54.2789869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2790370Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79ea95a80>} 2025-05-07T20:32:54.2791166Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2791400Z context = 2025-05-07T20:32:54.2791405Z 2025-05-07T20:32:54.2791566Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2791824Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2791932Z module_map=module_map) 2025-05-07T20:32:54.2792132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2792229Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2792306Z E ^ 2025-05-07T20:32:54.2792651Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2792656Z 2025-05-07T20:32:54.2793063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2793067Z 2025-05-07T20:32:54.2793206Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2793428Z self=, 2025-05-07T20:32:54.2793505Z T=2048, 2025-05-07T20:32:54.2793577Z D=5120, 2025-05-07T20:32:54.2793655Z scale_ub=None, 2025-05-07T20:32:54.2793739Z contiguous=True, 2025-05-07T20:32:54.2793820Z compiled=False, 2025-05-07T20:32:54.2793889Z ) 2025-05-07T20:32:54.2794109Z self = 2025-05-07T20:32:54.2794276Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2794281Z 2025-05-07T20:32:54.2794360Z @given( 2025-05-07T20:32:54.2794477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2794573Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2794693Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2794805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2794920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2794999Z ) 2025-05-07T20:32:54.2795240Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2795333Z def test_silu_mul_quant( 2025-05-07T20:32:54.2795404Z self, 2025-05-07T20:32:54.2795475Z T: int, 2025-05-07T20:32:54.2795552Z D: int, 2025-05-07T20:32:54.2795650Z scale_ub: Optional[float], 2025-05-07T20:32:54.2795735Z contiguous: bool, 2025-05-07T20:32:54.2795820Z compiled: bool, 2025-05-07T20:32:54.2795899Z ) -> None: 2025-05-07T20:32:54.2795990Z torch.manual_seed(2025) 2025-05-07T20:32:54.2796060Z 2025-05-07T20:32:54.2796224Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2796299Z 2025-05-07T20:32:54.2796387Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.2798145Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2798154Z 2025-05-07T20:32:54.2798270Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.2798275Z 2025-05-07T20:32:54.2798371Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2798591Z self=, 2025-05-07T20:32:54.2798665Z T=16384, 2025-05-07T20:32:54.2798783Z D=5120, 2025-05-07T20:32:54.2798862Z scale_ub=None, 2025-05-07T20:32:54.2798944Z contiguous=True, 2025-05-07T20:32:54.2799062Z compiled=False, 2025-05-07T20:32:54.2799134Z ) 2025-05-07T20:32:54.2799344Z self = 2025-05-07T20:32:54.2799515Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2799519Z 2025-05-07T20:32:54.2799594Z @given( 2025-05-07T20:32:54.2799709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2799850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2799958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2800070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2800181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2800255Z ) 2025-05-07T20:32:54.2800493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2800587Z def test_silu_mul_quant( 2025-05-07T20:32:54.2800659Z self, 2025-05-07T20:32:54.2800733Z T: int, 2025-05-07T20:32:54.2800856Z D: int, 2025-05-07T20:32:54.2800952Z scale_ub: Optional[float], 2025-05-07T20:32:54.2801042Z contiguous: bool, 2025-05-07T20:32:54.2801123Z compiled: bool, 2025-05-07T20:32:54.2801194Z ) -> None: 2025-05-07T20:32:54.2801288Z torch.manual_seed(2025) 2025-05-07T20:32:54.2801355Z 2025-05-07T20:32:54.2801518Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2803268Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2803277Z 2025-05-07T20:32:54.2803388Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2803392Z 2025-05-07T20:32:54.2803492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2803706Z self=, 2025-05-07T20:32:54.2803779Z T=4096, 2025-05-07T20:32:54.2803856Z D=5120, 2025-05-07T20:32:54.2803935Z scale_ub=None, 2025-05-07T20:32:54.2804013Z contiguous=True, 2025-05-07T20:32:54.2804095Z compiled=False, 2025-05-07T20:32:54.2804163Z ) 2025-05-07T20:32:54.2804375Z self = 2025-05-07T20:32:54.2804540Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2804545Z 2025-05-07T20:32:54.2804621Z @given( 2025-05-07T20:32:54.2804737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2804835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2804947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2805060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2805169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2805246Z ) 2025-05-07T20:32:54.2805485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2805572Z def test_silu_mul_quant( 2025-05-07T20:32:54.2805651Z self, 2025-05-07T20:32:54.2805723Z T: int, 2025-05-07T20:32:54.2805798Z D: int, 2025-05-07T20:32:54.2805894Z scale_ub: Optional[float], 2025-05-07T20:32:54.2805978Z contiguous: bool, 2025-05-07T20:32:54.2806060Z compiled: bool, 2025-05-07T20:32:54.2806138Z ) -> None: 2025-05-07T20:32:54.2806226Z torch.manual_seed(2025) 2025-05-07T20:32:54.2806342Z 2025-05-07T20:32:54.2806506Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2808286Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2808329Z 2025-05-07T20:32:54.2808444Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2808449Z 2025-05-07T20:32:54.2808547Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2808764Z self=, 2025-05-07T20:32:54.2808838Z T=2048, 2025-05-07T20:32:54.2808909Z D=5120, 2025-05-07T20:32:54.2808990Z scale_ub=None, 2025-05-07T20:32:54.2809111Z contiguous=False, 2025-05-07T20:32:54.2809194Z compiled=False, 2025-05-07T20:32:54.2809264Z ) 2025-05-07T20:32:54.2809474Z self = 2025-05-07T20:32:54.2809640Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2809645Z 2025-05-07T20:32:54.2809724Z @given( 2025-05-07T20:32:54.2809836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2809938Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2810046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2810157Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2810269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2810339Z ) 2025-05-07T20:32:54.2810581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2810673Z def test_silu_mul_quant( 2025-05-07T20:32:54.2810749Z self, 2025-05-07T20:32:54.2810825Z T: int, 2025-05-07T20:32:54.2810901Z D: int, 2025-05-07T20:32:54.2810994Z scale_ub: Optional[float], 2025-05-07T20:32:54.2811080Z contiguous: bool, 2025-05-07T20:32:54.2811162Z compiled: bool, 2025-05-07T20:32:54.2811236Z ) -> None: 2025-05-07T20:32:54.2811328Z torch.manual_seed(2025) 2025-05-07T20:32:54.2811395Z 2025-05-07T20:32:54.2811559Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2813376Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2813385Z 2025-05-07T20:32:54.2813495Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2813500Z 2025-05-07T20:32:54.2813600Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2813814Z self=, 2025-05-07T20:32:54.2813890Z T=4096, 2025-05-07T20:32:54.2813965Z D=7168, 2025-05-07T20:32:54.2814044Z scale_ub=None, 2025-05-07T20:32:54.2814123Z contiguous=True, 2025-05-07T20:32:54.2814203Z compiled=True, 2025-05-07T20:32:54.2814273Z ) 2025-05-07T20:32:54.2814487Z self = 2025-05-07T20:32:54.2814648Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2814701Z 2025-05-07T20:32:54.2814775Z @given( 2025-05-07T20:32:54.2814951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2815049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2815158Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2815270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2815377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2815453Z ) 2025-05-07T20:32:54.2815692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2815823Z def test_silu_mul_quant( 2025-05-07T20:32:54.2815900Z self, 2025-05-07T20:32:54.2815975Z T: int, 2025-05-07T20:32:54.2816048Z D: int, 2025-05-07T20:32:54.2816147Z scale_ub: Optional[float], 2025-05-07T20:32:54.2816232Z contiguous: bool, 2025-05-07T20:32:54.2816313Z compiled: bool, 2025-05-07T20:32:54.2816390Z ) -> None: 2025-05-07T20:32:54.2816480Z torch.manual_seed(2025) 2025-05-07T20:32:54.2816546Z 2025-05-07T20:32:54.2816755Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2818498Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2818507Z 2025-05-07T20:32:54.2818623Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2818628Z 2025-05-07T20:32:54.2818726Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2818944Z self=, 2025-05-07T20:32:54.2819019Z T=2048, 2025-05-07T20:32:54.2819095Z D=5120, 2025-05-07T20:32:54.2819176Z scale_ub=1200.0, 2025-05-07T20:32:54.2819257Z contiguous=False, 2025-05-07T20:32:54.2819337Z compiled=False, 2025-05-07T20:32:54.2819407Z ) 2025-05-07T20:32:54.2819618Z self = 2025-05-07T20:32:54.2819785Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.2819793Z 2025-05-07T20:32:54.2819872Z @given( 2025-05-07T20:32:54.2819985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2820083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2820191Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2820301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2820413Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2820488Z ) 2025-05-07T20:32:54.2820726Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2820821Z def test_silu_mul_quant( 2025-05-07T20:32:54.2820892Z self, 2025-05-07T20:32:54.2820965Z T: int, 2025-05-07T20:32:54.2821042Z D: int, 2025-05-07T20:32:54.2821136Z scale_ub: Optional[float], 2025-05-07T20:32:54.2821224Z contiguous: bool, 2025-05-07T20:32:54.2821305Z compiled: bool, 2025-05-07T20:32:54.2821377Z ) -> None: 2025-05-07T20:32:54.2821474Z torch.manual_seed(2025) 2025-05-07T20:32:54.2821545Z 2025-05-07T20:32:54.2821707Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2823489Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2823530Z 2025-05-07T20:32:54.2823643Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2823647Z 2025-05-07T20:32:54.2823746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2823960Z self=, 2025-05-07T20:32:54.2824071Z T=4096, 2025-05-07T20:32:54.2824149Z D=7168, 2025-05-07T20:32:54.2824227Z scale_ub=1200.0, 2025-05-07T20:32:54.2824305Z contiguous=True, 2025-05-07T20:32:54.2824385Z compiled=False, 2025-05-07T20:32:54.2824454Z ) 2025-05-07T20:32:54.2824665Z self = 2025-05-07T20:32:54.2824833Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2824838Z 2025-05-07T20:32:54.2824914Z @given( 2025-05-07T20:32:54.2825068Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2825165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2825273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2825387Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2825495Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2825569Z ) 2025-05-07T20:32:54.2825809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2825896Z def test_silu_mul_quant( 2025-05-07T20:32:54.2825973Z self, 2025-05-07T20:32:54.2826047Z T: int, 2025-05-07T20:32:54.2826118Z D: int, 2025-05-07T20:32:54.2826214Z scale_ub: Optional[float], 2025-05-07T20:32:54.2826299Z contiguous: bool, 2025-05-07T20:32:54.2826382Z compiled: bool, 2025-05-07T20:32:54.2826460Z ) -> None: 2025-05-07T20:32:54.2826550Z torch.manual_seed(2025) 2025-05-07T20:32:54.2826621Z 2025-05-07T20:32:54.2826786Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2828520Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2828529Z 2025-05-07T20:32:54.2828646Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2828654Z 2025-05-07T20:32:54.2828752Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2828970Z self=, 2025-05-07T20:32:54.2829043Z T=16384, 2025-05-07T20:32:54.2829114Z D=7168, 2025-05-07T20:32:54.2829198Z scale_ub=None, 2025-05-07T20:32:54.2829278Z contiguous=False, 2025-05-07T20:32:54.2829354Z compiled=True, 2025-05-07T20:32:54.2829423Z ) 2025-05-07T20:32:54.2829631Z self = 2025-05-07T20:32:54.2829800Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.2829806Z 2025-05-07T20:32:54.2829885Z @given( 2025-05-07T20:32:54.2829998Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2830094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2830203Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2830313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2830472Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2830543Z ) 2025-05-07T20:32:54.2830823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2830918Z def test_silu_mul_quant( 2025-05-07T20:32:54.2830990Z self, 2025-05-07T20:32:54.2831061Z T: int, 2025-05-07T20:32:54.2831136Z D: int, 2025-05-07T20:32:54.2831232Z scale_ub: Optional[float], 2025-05-07T20:32:54.2831320Z contiguous: bool, 2025-05-07T20:32:54.2831400Z compiled: bool, 2025-05-07T20:32:54.2831515Z ) -> None: 2025-05-07T20:32:54.2831607Z torch.manual_seed(2025) 2025-05-07T20:32:54.2831674Z 2025-05-07T20:32:54.2831835Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2833612Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2833622Z 2025-05-07T20:32:54.2833732Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2833737Z 2025-05-07T20:32:54.2833836Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2834052Z self=, 2025-05-07T20:32:54.2834122Z T=4096, 2025-05-07T20:32:54.2834198Z D=7168, 2025-05-07T20:32:54.2834277Z scale_ub=None, 2025-05-07T20:32:54.2834354Z contiguous=True, 2025-05-07T20:32:54.2834439Z compiled=False, 2025-05-07T20:32:54.2834509Z ) 2025-05-07T20:32:54.2834721Z self = 2025-05-07T20:32:54.2834890Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2834897Z 2025-05-07T20:32:54.2834969Z @given( 2025-05-07T20:32:54.2835089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2835185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2835293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2835406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2835514Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2835592Z ) 2025-05-07T20:32:54.2835829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2835918Z def test_silu_mul_quant( 2025-05-07T20:32:54.2835993Z self, 2025-05-07T20:32:54.2836066Z T: int, 2025-05-07T20:32:54.2836138Z D: int, 2025-05-07T20:32:54.2836238Z scale_ub: Optional[float], 2025-05-07T20:32:54.2836324Z contiguous: bool, 2025-05-07T20:32:54.2836405Z compiled: bool, 2025-05-07T20:32:54.2836486Z ) -> None: 2025-05-07T20:32:54.2836579Z torch.manual_seed(2025) 2025-05-07T20:32:54.2836645Z 2025-05-07T20:32:54.2836809Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2838538Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2838547Z 2025-05-07T20:32:54.2838707Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2838711Z 2025-05-07T20:32:54.2838809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2839068Z self=, 2025-05-07T20:32:54.2839142Z T=16384, 2025-05-07T20:32:54.2839220Z D=7168, 2025-05-07T20:32:54.2839298Z scale_ub=None, 2025-05-07T20:32:54.2839375Z contiguous=True, 2025-05-07T20:32:54.2839457Z compiled=False, 2025-05-07T20:32:54.2839524Z ) 2025-05-07T20:32:54.2839734Z self = 2025-05-07T20:32:54.2839946Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2839950Z 2025-05-07T20:32:54.2840021Z @given( 2025-05-07T20:32:54.2840133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2840231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2840339Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2840456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2840565Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2840672Z ) 2025-05-07T20:32:54.2840914Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2841002Z def test_silu_mul_quant( 2025-05-07T20:32:54.2841074Z self, 2025-05-07T20:32:54.2841146Z T: int, 2025-05-07T20:32:54.2841218Z D: int, 2025-05-07T20:32:54.2841311Z scale_ub: Optional[float], 2025-05-07T20:32:54.2841403Z contiguous: bool, 2025-05-07T20:32:54.2841484Z compiled: bool, 2025-05-07T20:32:54.2841558Z ) -> None: 2025-05-07T20:32:54.2841651Z torch.manual_seed(2025) 2025-05-07T20:32:54.2841718Z 2025-05-07T20:32:54.2841878Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2843621Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2843630Z 2025-05-07T20:32:54.2843744Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2843752Z 2025-05-07T20:32:54.2843848Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2844062Z self=, 2025-05-07T20:32:54.2844138Z T=16384, 2025-05-07T20:32:54.2844213Z D=7168, 2025-05-07T20:32:54.2844293Z scale_ub=1200.0, 2025-05-07T20:32:54.2844373Z contiguous=True, 2025-05-07T20:32:54.2844456Z compiled=False, 2025-05-07T20:32:54.2844526Z ) 2025-05-07T20:32:54.2844737Z self = 2025-05-07T20:32:54.2844910Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2844914Z 2025-05-07T20:32:54.2844990Z @given( 2025-05-07T20:32:54.2845103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2845196Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2845309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2845423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2845533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2845602Z ) 2025-05-07T20:32:54.2845838Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2845927Z def test_silu_mul_quant( 2025-05-07T20:32:54.2846004Z self, 2025-05-07T20:32:54.2846076Z T: int, 2025-05-07T20:32:54.2846221Z D: int, 2025-05-07T20:32:54.2846314Z scale_ub: Optional[float], 2025-05-07T20:32:54.2846398Z contiguous: bool, 2025-05-07T20:32:54.2846521Z compiled: bool, 2025-05-07T20:32:54.2846597Z ) -> None: 2025-05-07T20:32:54.2846686Z torch.manual_seed(2025) 2025-05-07T20:32:54.2846760Z 2025-05-07T20:32:54.2846924Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2848658Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2848709Z 2025-05-07T20:32:54.2848821Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2848828Z 2025-05-07T20:32:54.2848965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2849186Z self=, 2025-05-07T20:32:54.2849261Z T=128, 2025-05-07T20:32:54.2849333Z D=5120, 2025-05-07T20:32:54.2849414Z scale_ub=1200.0, 2025-05-07T20:32:54.2849496Z contiguous=False, 2025-05-07T20:32:54.2849574Z compiled=False, 2025-05-07T20:32:54.2849650Z ) 2025-05-07T20:32:54.2849859Z self = 2025-05-07T20:32:54.2850025Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.2850029Z 2025-05-07T20:32:54.2850100Z @given( 2025-05-07T20:32:54.2850212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2850308Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2850422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2850535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2850649Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2850718Z ) 2025-05-07T20:32:54.2850955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2851046Z def test_silu_mul_quant( 2025-05-07T20:32:54.2851117Z self, 2025-05-07T20:32:54.2851192Z T: int, 2025-05-07T20:32:54.2851261Z D: int, 2025-05-07T20:32:54.2851357Z scale_ub: Optional[float], 2025-05-07T20:32:54.2851445Z contiguous: bool, 2025-05-07T20:32:54.2851526Z compiled: bool, 2025-05-07T20:32:54.2851599Z ) -> None: 2025-05-07T20:32:54.2851693Z torch.manual_seed(2025) 2025-05-07T20:32:54.2851762Z 2025-05-07T20:32:54.2851922Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2851996Z 2025-05-07T20:32:54.2852085Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2852208Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2852299Z x = x_sign * x_clamp 2025-05-07T20:32:54.2852378Z x0 = x[:, :D] 2025-05-07T20:32:54.2852462Z x1 = x[:, D:] 2025-05-07T20:32:54.2852529Z 2025-05-07T20:32:54.2852606Z if contiguous: 2025-05-07T20:32:54.2852695Z x0 = x0.contiguous() 2025-05-07T20:32:54.2852780Z x1 = x1.contiguous() 2025-05-07T20:32:54.2852845Z 2025-05-07T20:32:54.2852936Z if scale_ub is not None: 2025-05-07T20:32:54.2853129Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2853260Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2853334Z ) 2025-05-07T20:32:54.2853406Z else: 2025-05-07T20:32:54.2853496Z scale_ub_tensor = None 2025-05-07T20:32:54.2853564Z 2025-05-07T20:32:54.2853690Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2853826Z op = silu_mul_quant 2025-05-07T20:32:54.2853911Z if compiled: 2025-05-07T20:32:54.2854048Z op = torch.compile(op) 2025-05-07T20:32:54.2854154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2854222Z 2025-05-07T20:32:54.2854308Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2854312Z 2025-05-07T20:32:54.2854408Z moe/activation_test.py:117: 2025-05-07T20:32:54.2854532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2854668Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2854767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2855259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2855358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2855709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2855930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2856316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2856407Z kernel = self.compile( 2025-05-07T20:32:54.2856780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2856952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2857079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2857084Z 2025-05-07T20:32:54.2857288Z self = 2025-05-07T20:32:54.2858051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2858552Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79e8247c0>} 2025-05-07T20:32:54.2859839Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2860033Z context = 2025-05-07T20:32:54.2860042Z 2025-05-07T20:32:54.2860205Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2860458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2860563Z module_map=module_map) 2025-05-07T20:32:54.2860724Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2860818Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2860892Z E ^ 2025-05-07T20:32:54.2861243Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2861248Z 2025-05-07T20:32:54.2861652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2861656Z 2025-05-07T20:32:54.2861760Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2861978Z self=, 2025-05-07T20:32:54.2862055Z T=2048, 2025-05-07T20:32:54.2862128Z D=7168, 2025-05-07T20:32:54.2862205Z scale_ub=None, 2025-05-07T20:32:54.2862288Z contiguous=False, 2025-05-07T20:32:54.2862367Z compiled=False, 2025-05-07T20:32:54.2862440Z ) 2025-05-07T20:32:54.2862655Z self = 2025-05-07T20:32:54.2862903Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.2862964Z 2025-05-07T20:32:54.2863038Z @given( 2025-05-07T20:32:54.2863157Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2863254Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2863369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2863481Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2863589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2863724Z ) 2025-05-07T20:32:54.2863963Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2864053Z def test_silu_mul_quant( 2025-05-07T20:32:54.2864126Z self, 2025-05-07T20:32:54.2864198Z T: int, 2025-05-07T20:32:54.2864271Z D: int, 2025-05-07T20:32:54.2864368Z scale_ub: Optional[float], 2025-05-07T20:32:54.2864456Z contiguous: bool, 2025-05-07T20:32:54.2864537Z compiled: bool, 2025-05-07T20:32:54.2864611Z ) -> None: 2025-05-07T20:32:54.2864762Z torch.manual_seed(2025) 2025-05-07T20:32:54.2864837Z 2025-05-07T20:32:54.2865001Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2866744Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2866756Z 2025-05-07T20:32:54.2866874Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2866881Z 2025-05-07T20:32:54.2866979Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2867202Z self=, 2025-05-07T20:32:54.2867273Z T=128, 2025-05-07T20:32:54.2867343Z D=7168, 2025-05-07T20:32:54.2867422Z scale_ub=1200.0, 2025-05-07T20:32:54.2867503Z contiguous=True, 2025-05-07T20:32:54.2867584Z compiled=True, 2025-05-07T20:32:54.2867653Z ) 2025-05-07T20:32:54.2867862Z self = 2025-05-07T20:32:54.2868028Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.2868032Z 2025-05-07T20:32:54.2868105Z @given( 2025-05-07T20:32:54.2868219Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2868318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2868427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2868540Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2868650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2868723Z ) 2025-05-07T20:32:54.2868968Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2869057Z def test_silu_mul_quant( 2025-05-07T20:32:54.2869127Z self, 2025-05-07T20:32:54.2869201Z T: int, 2025-05-07T20:32:54.2869274Z D: int, 2025-05-07T20:32:54.2869367Z scale_ub: Optional[float], 2025-05-07T20:32:54.2869455Z contiguous: bool, 2025-05-07T20:32:54.2869539Z compiled: bool, 2025-05-07T20:32:54.2869617Z ) -> None: 2025-05-07T20:32:54.2869712Z torch.manual_seed(2025) 2025-05-07T20:32:54.2869781Z 2025-05-07T20:32:54.2869942Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2870013Z 2025-05-07T20:32:54.2870102Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2870224Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2870359Z x = x_sign * x_clamp 2025-05-07T20:32:54.2870434Z x0 = x[:, :D] 2025-05-07T20:32:54.2870551Z x1 = x[:, D:] 2025-05-07T20:32:54.2870619Z 2025-05-07T20:32:54.2870700Z if contiguous: 2025-05-07T20:32:54.2870790Z x0 = x0.contiguous() 2025-05-07T20:32:54.2870874Z x1 = x1.contiguous() 2025-05-07T20:32:54.2870944Z 2025-05-07T20:32:54.2871034Z if scale_ub is not None: 2025-05-07T20:32:54.2871136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2871307Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2871379Z ) 2025-05-07T20:32:54.2871451Z else: 2025-05-07T20:32:54.2871542Z scale_ub_tensor = None 2025-05-07T20:32:54.2871612Z 2025-05-07T20:32:54.2871737Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2871827Z op = silu_mul_quant 2025-05-07T20:32:54.2871909Z if compiled: 2025-05-07T20:32:54.2872004Z op = torch.compile(op) 2025-05-07T20:32:54.2872115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2872250Z 2025-05-07T20:32:54.2872339Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2872343Z 2025-05-07T20:32:54.2872438Z moe/activation_test.py:117: 2025-05-07T20:32:54.2872562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2872656Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2872758Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2873123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.2873214Z return fn(*args, **kwargs) 2025-05-07T20:32:54.2873696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2873789Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2874140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2874361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2874691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2874786Z kernel = self.compile( 2025-05-07T20:32:54.2875156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2875329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2875453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2875458Z 2025-05-07T20:32:54.2875658Z self = 2025-05-07T20:32:54.2876422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2876919Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc79e825940>} 2025-05-07T20:32:54.2877651Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2877839Z context = 2025-05-07T20:32:54.2877843Z 2025-05-07T20:32:54.2878004Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2878260Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2878414Z module_map=module_map) 2025-05-07T20:32:54.2878577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2878712Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2878788Z E ^ 2025-05-07T20:32:54.2879137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2879142Z 2025-05-07T20:32:54.2879544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2879548Z 2025-05-07T20:32:54.2879686Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2879903Z self=, 2025-05-07T20:32:54.2879976Z T=128, 2025-05-07T20:32:54.2880051Z D=7168, 2025-05-07T20:32:54.2880131Z scale_ub=1200.0, 2025-05-07T20:32:54.2880211Z contiguous=True, 2025-05-07T20:32:54.2880293Z compiled=False, 2025-05-07T20:32:54.2880366Z ) 2025-05-07T20:32:54.2880579Z self = 2025-05-07T20:32:54.2880785Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2880791Z 2025-05-07T20:32:54.2880868Z @given( 2025-05-07T20:32:54.2880985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2881079Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2881190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2881304Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2881414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2881484Z ) 2025-05-07T20:32:54.2881725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2881814Z def test_silu_mul_quant( 2025-05-07T20:32:54.2881888Z self, 2025-05-07T20:32:54.2881961Z T: int, 2025-05-07T20:32:54.2882034Z D: int, 2025-05-07T20:32:54.2882132Z scale_ub: Optional[float], 2025-05-07T20:32:54.2882219Z contiguous: bool, 2025-05-07T20:32:54.2882302Z compiled: bool, 2025-05-07T20:32:54.2882384Z ) -> None: 2025-05-07T20:32:54.2882477Z torch.manual_seed(2025) 2025-05-07T20:32:54.2882548Z 2025-05-07T20:32:54.2882715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2882786Z 2025-05-07T20:32:54.2882879Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2883002Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2884746Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2884759Z 2025-05-07T20:32:54.2884876Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.2884881Z 2025-05-07T20:32:54.2884979Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2885197Z self=, 2025-05-07T20:32:54.2885268Z T=128, 2025-05-07T20:32:54.2885340Z D=5120, 2025-05-07T20:32:54.2885421Z scale_ub=1200.0, 2025-05-07T20:32:54.2885504Z contiguous=True, 2025-05-07T20:32:54.2885581Z compiled=True, 2025-05-07T20:32:54.2885656Z ) 2025-05-07T20:32:54.2885866Z self = 2025-05-07T20:32:54.2886026Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.2886031Z 2025-05-07T20:32:54.2886108Z @given( 2025-05-07T20:32:54.2886268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2886368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2886518Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2886630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2886744Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2886814Z ) 2025-05-07T20:32:54.2887054Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2887145Z def test_silu_mul_quant( 2025-05-07T20:32:54.2887257Z self, 2025-05-07T20:32:54.2887329Z T: int, 2025-05-07T20:32:54.2887405Z D: int, 2025-05-07T20:32:54.2887499Z scale_ub: Optional[float], 2025-05-07T20:32:54.2887584Z contiguous: bool, 2025-05-07T20:32:54.2887669Z compiled: bool, 2025-05-07T20:32:54.2887739Z ) -> None: 2025-05-07T20:32:54.2887841Z torch.manual_seed(2025) 2025-05-07T20:32:54.2887912Z 2025-05-07T20:32:54.2888073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2888143Z 2025-05-07T20:32:54.2888273Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2888397Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2890134Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2890142Z 2025-05-07T20:32:54.2890254Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.2890260Z 2025-05-07T20:32:54.2890358Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2890574Z self=, 2025-05-07T20:32:54.2890651Z T=128, 2025-05-07T20:32:54.2890731Z D=7168, 2025-05-07T20:32:54.2890806Z scale_ub=None, 2025-05-07T20:32:54.2890889Z contiguous=True, 2025-05-07T20:32:54.2890968Z compiled=True, 2025-05-07T20:32:54.2891036Z ) 2025-05-07T20:32:54.2891247Z self = 2025-05-07T20:32:54.2891406Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.2891413Z 2025-05-07T20:32:54.2891486Z @given( 2025-05-07T20:32:54.2891603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2891697Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2891807Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2891923Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2892032Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2892107Z ) 2025-05-07T20:32:54.2892349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2892438Z def test_silu_mul_quant( 2025-05-07T20:32:54.2892511Z self, 2025-05-07T20:32:54.2892584Z T: int, 2025-05-07T20:32:54.2892655Z D: int, 2025-05-07T20:32:54.2892752Z scale_ub: Optional[float], 2025-05-07T20:32:54.2892836Z contiguous: bool, 2025-05-07T20:32:54.2892917Z compiled: bool, 2025-05-07T20:32:54.2893070Z ) -> None: 2025-05-07T20:32:54.2893162Z torch.manual_seed(2025) 2025-05-07T20:32:54.2893233Z 2025-05-07T20:32:54.2893400Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2895178Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2895226Z 2025-05-07T20:32:54.2895338Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2895467Z =============================== warnings summary =============================== 2025-05-07T20:32:54.2895809Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:54.2896103Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:54.2896391Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:54.2897291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:54.2897517Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:54.2897522Z 2025-05-07T20:32:54.2897732Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:54.2897896Z ================= 1 failed, 1 deselected, 3 warnings in 13.10s ================= 2025-05-07T20:32:55.7921630Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:55.8544768Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:55.8545082Z 2025-05-07T20:32:57.8561593Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:00.0089933Z ============================= test session starts ============================== 2025-05-07T20:33:00.0091306Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:00.0092233Z cachedir: .pytest_cache 2025-05-07T20:33:00.0093394Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:00.0100385Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:00.0100839Z plugins: hypothesis-6.131.14 2025-05-07T20:33:01.6152995Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:01.7240757Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:01.7241305Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:01.7241603Z 2025-05-07T20:33:04.0718497Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0720313Z self=, 2025-05-07T20:33:04.0721176Z T=1, 2025-05-07T20:33:04.0721564Z D=5120, 2025-05-07T20:33:04.0721968Z scale_ub=None, 2025-05-07T20:33:04.0722415Z contiguous=True, 2025-05-07T20:33:04.0722865Z compiled=True, 2025-05-07T20:33:04.0723306Z ) 2025-05-07T20:33:04.0723956Z self = 2025-05-07T20:33:04.0724931Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:04.0725451Z 2025-05-07T20:33:04.0725613Z @given( 2025-05-07T20:33:04.0726083Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0726684Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0727425Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0727856Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0728194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0728477Z ) 2025-05-07T20:33:04.0728829Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0729276Z def test_silu_mul_quant( 2025-05-07T20:33:04.0729525Z self, 2025-05-07T20:33:04.0729719Z T: int, 2025-05-07T20:33:04.0729927Z D: int, 2025-05-07T20:33:04.0730250Z scale_ub: Optional[float], 2025-05-07T20:33:04.0730518Z contiguous: bool, 2025-05-07T20:33:04.0730765Z compiled: bool, 2025-05-07T20:33:04.0730996Z ) -> None: 2025-05-07T20:33:04.0731211Z torch.manual_seed(2025) 2025-05-07T20:33:04.0731458Z 2025-05-07T20:33:04.0731739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0732079Z 2025-05-07T20:33:04.0732277Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0732572Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0733094Z x = x_sign * x_clamp 2025-05-07T20:33:04.0733343Z x0 = x[:, :D] 2025-05-07T20:33:04.0733567Z x1 = x[:, D:] 2025-05-07T20:33:04.0733775Z 2025-05-07T20:33:04.0733966Z if contiguous: 2025-05-07T20:33:04.0734207Z x0 = x0.contiguous() 2025-05-07T20:33:04.0734462Z x1 = x1.contiguous() 2025-05-07T20:33:04.0734712Z 2025-05-07T20:33:04.0734918Z if scale_ub is not None: 2025-05-07T20:33:04.0735195Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0735535Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0735849Z ) 2025-05-07T20:33:04.0736051Z else: 2025-05-07T20:33:04.0736261Z scale_ub_tensor = None 2025-05-07T20:33:04.0736523Z 2025-05-07T20:33:04.0736764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0737083Z op = silu_mul_quant 2025-05-07T20:33:04.0737344Z if compiled: 2025-05-07T20:33:04.0737603Z op = torch.compile(op) 2025-05-07T20:33:04.0737900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0738183Z 2025-05-07T20:33:04.0738389Z y_fp8, y_scale = fn() 2025-05-07T20:33:04.0738672Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:04.0738974Z 2025-05-07T20:33:04.0739221Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0739555Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:04.0739858Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:04.0740177Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:04.0740538Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.0740848Z 2025-05-07T20:33:04.0741057Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:04.0741255Z 2025-05-07T20:33:04.0741366Z moe/activation_test.py:126: 2025-05-07T20:33:04.0741667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0742007Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:04.0742336Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.0743127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:04.0743874Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:04.0744426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0745105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0745784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:04.0746569Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:04.0747348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:04.0747985Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:04.0748578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:04.0749095Z fn() 2025-05-07T20:33:04.0749603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:04.0750232Z self.fn.run( 2025-05-07T20:33:04.0750701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0751231Z kernel = self.compile( 2025-05-07T20:33:04.0751770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0752417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0752868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0753095Z 2025-05-07T20:33:04.0753310Z self = 2025-05-07T20:33:04.0754394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0755771Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057db01c60>} 2025-05-07T20:33:04.0757098Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0758121Z context = 2025-05-07T20:33:04.0758412Z 2025-05-07T20:33:04.0758591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0759101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0760096Z module_map=module_map) 2025-05-07T20:33:04.0760469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0760956Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:04.0761250Z E ^ 2025-05-07T20:33:04.0761728Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0762182Z 2025-05-07T20:33:04.0762607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0763120Z 2025-05-07T20:33:04.0763237Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0763659Z self=, 2025-05-07T20:33:04.0764075Z T=2048, 2025-05-07T20:33:04.0764286Z D=5120, 2025-05-07T20:33:04.0764486Z scale_ub=1200.0, 2025-05-07T20:33:04.0764724Z contiguous=True, 2025-05-07T20:33:04.0764964Z compiled=False, 2025-05-07T20:33:04.0765178Z ) 2025-05-07T20:33:04.8084334Z self = 2025-05-07T20:33:04.8085171Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:04.8085568Z 2025-05-07T20:33:04.8085681Z @given( 2025-05-07T20:33:04.8085995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.8086326Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.8086635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.8086980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.8087647Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.8087935Z ) 2025-05-07T20:33:04.8088383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.8088839Z def test_silu_mul_quant( 2025-05-07T20:33:04.8089089Z self, 2025-05-07T20:33:04.8089288Z T: int, 2025-05-07T20:33:04.8089492Z D: int, 2025-05-07T20:33:04.8089717Z scale_ub: Optional[float], 2025-05-07T20:33:04.8089987Z contiguous: bool, 2025-05-07T20:33:04.8090230Z compiled: bool, 2025-05-07T20:33:04.8090540Z ) -> None: 2025-05-07T20:33:04.8090755Z torch.manual_seed(2025) 2025-05-07T20:33:04.8091006Z 2025-05-07T20:33:04.8091283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.8091623Z 2025-05-07T20:33:04.8091827Z x_sign = torch.sign(x) 2025-05-07T20:33:04.8092126Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.8092432Z x = x_sign * x_clamp 2025-05-07T20:33:04.8092684Z x0 = x[:, :D] 2025-05-07T20:33:04.8092912Z x1 = x[:, D:] 2025-05-07T20:33:04.8093319Z 2025-05-07T20:33:04.8093519Z if contiguous: 2025-05-07T20:33:04.8093760Z x0 = x0.contiguous() 2025-05-07T20:33:04.8094016Z x1 = x1.contiguous() 2025-05-07T20:33:04.8094267Z 2025-05-07T20:33:04.8094474Z if scale_ub is not None: 2025-05-07T20:33:04.8094757Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.8095091Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.8095415Z ) 2025-05-07T20:33:04.8095616Z else: 2025-05-07T20:33:04.8095828Z scale_ub_tensor = None 2025-05-07T20:33:04.8096089Z 2025-05-07T20:33:04.8096337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.8096657Z op = silu_mul_quant 2025-05-07T20:33:04.8096912Z if compiled: 2025-05-07T20:33:04.8097162Z op = torch.compile(op) 2025-05-07T20:33:04.8097461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8097744Z 2025-05-07T20:33:04.8097939Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.8098108Z 2025-05-07T20:33:04.8098210Z moe/activation_test.py:117: 2025-05-07T20:33:04.8098513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8098846Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.8099130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8099820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.8100508Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.8101038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.8101716Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.8102375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.8102909Z kernel = self.compile( 2025-05-07T20:33:04.8103444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.8104091Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.8104488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8104719Z 2025-05-07T20:33:04.8104930Z self = 2025-05-07T20:33:04.8105995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.8107450Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057d958220>} 2025-05-07T20:33:04.8108816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.8109828Z context = 2025-05-07T20:33:04.8110110Z 2025-05-07T20:33:04.8110275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.8110832Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.8111298Z module_map=module_map) 2025-05-07T20:33:04.8111664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.8112014Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.8112281Z E ^ 2025-05-07T20:33:04.8112760Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.8113213Z 2025-05-07T20:33:04.8113668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.8114180Z 2025-05-07T20:33:04.8114291Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.8114710Z self=, 2025-05-07T20:33:04.8115119Z T=2048, 2025-05-07T20:33:04.8115317Z D=5120, 2025-05-07T20:33:04.8115517Z scale_ub=1200.0, 2025-05-07T20:33:04.8115747Z contiguous=True, 2025-05-07T20:33:04.8115970Z compiled=True, 2025-05-07T20:33:04.8116183Z ) 2025-05-07T20:33:04.8116511Z self = 2025-05-07T20:33:04.8116999Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:04.8117278Z 2025-05-07T20:33:04.8117355Z @given( 2025-05-07T20:33:04.8117589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.8117902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.8118214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.8118547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.8118879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.8119159Z ) 2025-05-07T20:33:04.8119510Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.8119952Z def test_silu_mul_quant( 2025-05-07T20:33:04.8120192Z self, 2025-05-07T20:33:04.8120392Z T: int, 2025-05-07T20:33:04.8120593Z D: int, 2025-05-07T20:33:04.8120808Z scale_ub: Optional[float], 2025-05-07T20:33:04.8121079Z contiguous: bool, 2025-05-07T20:33:04.8121321Z compiled: bool, 2025-05-07T20:33:04.8121543Z ) -> None: 2025-05-07T20:33:04.8121762Z torch.manual_seed(2025) 2025-05-07T20:33:04.8122010Z 2025-05-07T20:33:04.8122280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.8122628Z 2025-05-07T20:33:04.8122829Z x_sign = torch.sign(x) 2025-05-07T20:33:04.8123116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.8123433Z x = x_sign * x_clamp 2025-05-07T20:33:04.8123685Z x0 = x[:, :D] 2025-05-07T20:33:04.8123899Z x1 = x[:, D:] 2025-05-07T20:33:04.8124118Z 2025-05-07T20:33:04.8124308Z if contiguous: 2025-05-07T20:33:04.8124550Z x0 = x0.contiguous() 2025-05-07T20:33:04.8124806Z x1 = x1.contiguous() 2025-05-07T20:33:04.8125052Z 2025-05-07T20:33:04.8125252Z if scale_ub is not None: 2025-05-07T20:33:04.8125524Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.8125871Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.8126248Z ) 2025-05-07T20:33:04.8126438Z else: 2025-05-07T20:33:04.8126653Z scale_ub_tensor = None 2025-05-07T20:33:04.8126905Z 2025-05-07T20:33:04.8127186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.8127513Z op = silu_mul_quant 2025-05-07T20:33:04.8127770Z if compiled: 2025-05-07T20:33:04.8128020Z op = torch.compile(op) 2025-05-07T20:33:04.8128321Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8128600Z 2025-05-07T20:33:04.8128794Z y_fp8, y_scale = fn() 2025-05-07T20:33:04.8129130Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:04.8129427Z 2025-05-07T20:33:04.8129669Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.8129999Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:04.8130293Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:04.8130612Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:04.8130966Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.8131278Z 2025-05-07T20:33:04.8131532Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:04.8131730Z 2025-05-07T20:33:04.8131829Z moe/activation_test.py:126: 2025-05-07T20:33:04.8132125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8132462Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:04.8132791Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.8133633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:04.8134385Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:04.8134926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.8135592Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.8136280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:04.8137000Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:04.8137722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:04.8138348Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:04.8138945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:04.8139457Z fn() 2025-05-07T20:33:04.8139960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:04.8140528Z self.fn.run( 2025-05-07T20:33:04.8140998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.8141528Z kernel = self.compile( 2025-05-07T20:33:04.8142063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.8142705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.8143097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8143322Z 2025-05-07T20:33:04.8143530Z self = 2025-05-07T20:33:04.8144585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.8145942Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057d9596c0>} 2025-05-07T20:33:04.8147403Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.8148412Z context = 2025-05-07T20:33:04.8148700Z 2025-05-07T20:33:04.8148874Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.8149384Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.8149885Z module_map=module_map) 2025-05-07T20:33:04.8150254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.8150610Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:04.8150882Z E ^ 2025-05-07T20:33:04.8151348Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.8151795Z 2025-05-07T20:33:04.8152213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.8152758Z 2025-05-07T20:33:04.8152869Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.8153283Z self=, 2025-05-07T20:33:04.8153686Z T=16384, 2025-05-07T20:33:04.8153878Z D=7168, 2025-05-07T20:33:04.8154083Z scale_ub=1200.0, 2025-05-07T20:33:04.8154318Z contiguous=False, 2025-05-07T20:33:04.8154546Z compiled=False, 2025-05-07T20:33:04.8154762Z ) 2025-05-07T20:33:05.5408614Z self = 2025-05-07T20:33:05.5409413Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:05.5409802Z 2025-05-07T20:33:05.5409925Z @given( 2025-05-07T20:33:05.5410240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.5410690Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.5411119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.5411486Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.5411813Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.5412108Z ) 2025-05-07T20:33:05.5412467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.5412908Z def test_silu_mul_quant( 2025-05-07T20:33:05.5413248Z self, 2025-05-07T20:33:05.5413452Z T: int, 2025-05-07T20:33:05.5413655Z D: int, 2025-05-07T20:33:05.5413882Z scale_ub: Optional[float], 2025-05-07T20:33:05.5414163Z contiguous: bool, 2025-05-07T20:33:05.5414408Z compiled: bool, 2025-05-07T20:33:05.5414651Z ) -> None: 2025-05-07T20:33:05.5414877Z torch.manual_seed(2025) 2025-05-07T20:33:05.5415128Z 2025-05-07T20:33:05.5415411Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.5415766Z 2025-05-07T20:33:05.5415970Z x_sign = torch.sign(x) 2025-05-07T20:33:05.5416267Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.5416589Z x = x_sign * x_clamp 2025-05-07T20:33:05.5416842Z x0 = x[:, :D] 2025-05-07T20:33:05.5417059Z x1 = x[:, D:] 2025-05-07T20:33:05.5417273Z 2025-05-07T20:33:05.5417469Z if contiguous: 2025-05-07T20:33:05.5417704Z x0 = x0.contiguous() 2025-05-07T20:33:05.5417967Z x1 = x1.contiguous() 2025-05-07T20:33:05.5418213Z 2025-05-07T20:33:05.5418403Z if scale_ub is not None: 2025-05-07T20:33:05.5418680Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.5419019Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.5419321Z ) 2025-05-07T20:33:05.5419521Z else: 2025-05-07T20:33:05.5419735Z scale_ub_tensor = None 2025-05-07T20:33:05.5420155Z 2025-05-07T20:33:05.5420389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.5420704Z op = silu_mul_quant 2025-05-07T20:33:05.5421041Z if compiled: 2025-05-07T20:33:05.5421292Z op = torch.compile(op) 2025-05-07T20:33:05.5421589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.5421866Z 2025-05-07T20:33:05.5422054Z > y_fp8, y_scale = fn() 2025-05-07T20:33:05.5422222Z 2025-05-07T20:33:05.5422323Z moe/activation_test.py:117: 2025-05-07T20:33:05.5422619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5423026Z moe/activation_test.py:115: in fn 2025-05-07T20:33:05.5423311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.5424005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:05.5424690Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:05.5425222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.5425975Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.5426635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.5427198Z kernel = self.compile( 2025-05-07T20:33:05.5427758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.5428412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.5428811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5429037Z 2025-05-07T20:33:05.5429242Z self = 2025-05-07T20:33:05.5430315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.5431691Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c824720>} 2025-05-07T20:33:05.5433019Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.5434037Z context = 2025-05-07T20:33:05.5434320Z 2025-05-07T20:33:05.5434488Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.5435008Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.5435478Z module_map=module_map) 2025-05-07T20:33:05.5435844Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.5436199Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.5436471Z E ^ 2025-05-07T20:33:05.5436930Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.5437432Z 2025-05-07T20:33:05.5437846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.5438357Z 2025-05-07T20:33:05.5438462Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.5438879Z self=, 2025-05-07T20:33:05.5439277Z T=1, 2025-05-07T20:33:05.5439472Z D=7168, 2025-05-07T20:33:05.5445665Z scale_ub=None, 2025-05-07T20:33:05.5445963Z contiguous=True, 2025-05-07T20:33:05.5446193Z compiled=True, 2025-05-07T20:33:05.5446405Z ) 2025-05-07T20:33:05.5446814Z self = 2025-05-07T20:33:05.5447331Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.5447598Z 2025-05-07T20:33:05.5447680Z @given( 2025-05-07T20:33:05.5447914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.5448224Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.5448534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.5448869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.5449189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.5449528Z ) 2025-05-07T20:33:05.5449881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.5450326Z def test_silu_mul_quant( 2025-05-07T20:33:05.5450568Z self, 2025-05-07T20:33:05.5450771Z T: int, 2025-05-07T20:33:05.5450974Z D: int, 2025-05-07T20:33:05.5451186Z scale_ub: Optional[float], 2025-05-07T20:33:05.5451465Z contiguous: bool, 2025-05-07T20:33:05.5451707Z compiled: bool, 2025-05-07T20:33:05.5451932Z ) -> None: 2025-05-07T20:33:05.5452197Z torch.manual_seed(2025) 2025-05-07T20:33:05.5452449Z 2025-05-07T20:33:05.5452719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.5453141Z 2025-05-07T20:33:05.5453347Z x_sign = torch.sign(x) 2025-05-07T20:33:05.5453637Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.5453953Z x = x_sign * x_clamp 2025-05-07T20:33:05.5454199Z x0 = x[:, :D] 2025-05-07T20:33:05.5454413Z x1 = x[:, D:] 2025-05-07T20:33:05.5454625Z 2025-05-07T20:33:05.5454814Z if contiguous: 2025-05-07T20:33:05.5455052Z x0 = x0.contiguous() 2025-05-07T20:33:05.5455325Z x1 = x1.contiguous() 2025-05-07T20:33:05.5455566Z 2025-05-07T20:33:05.5455759Z if scale_ub is not None: 2025-05-07T20:33:05.5456035Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.5456363Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.5456679Z ) 2025-05-07T20:33:05.5456881Z else: 2025-05-07T20:33:05.5457112Z scale_ub_tensor = None 2025-05-07T20:33:05.5457392Z 2025-05-07T20:33:05.5457633Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.5457946Z op = silu_mul_quant 2025-05-07T20:33:05.5458199Z if compiled: 2025-05-07T20:33:05.5458454Z op = torch.compile(op) 2025-05-07T20:33:05.5458751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.5459029Z 2025-05-07T20:33:05.5459488Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.5459778Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.5460074Z 2025-05-07T20:33:05.5460318Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.5460660Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.5460951Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.5461271Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.5461637Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.5461946Z 2025-05-07T20:33:05.5462154Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.5462347Z 2025-05-07T20:33:05.5462458Z moe/activation_test.py:126: 2025-05-07T20:33:05.5462751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5463095Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.5463425Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.5464217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.5464958Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.5465594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.5466335Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.5467014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.5467732Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.5468456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.5469166Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.5469757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.5470271Z fn() 2025-05-07T20:33:05.5470782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.5471362Z self.fn.run( 2025-05-07T20:33:05.5471894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.5472429Z kernel = self.compile( 2025-05-07T20:33:05.5472967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.5473606Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.5474000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.5474238Z 2025-05-07T20:33:05.5474444Z self = 2025-05-07T20:33:05.5475514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.5476876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c8242c0>} 2025-05-07T20:33:05.5478247Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.5479262Z context = 2025-05-07T20:33:05.5479545Z 2025-05-07T20:33:05.5479720Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.5480248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.5480712Z module_map=module_map) 2025-05-07T20:33:05.5481087Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.5481450Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.5481722Z E ^ 2025-05-07T20:33:05.5482195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.5482642Z 2025-05-07T20:33:05.5483065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.5483572Z 2025-05-07T20:33:05.5483688Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.5484102Z self=, 2025-05-07T20:33:05.5484512Z T=4096, 2025-05-07T20:33:05.5484711Z D=5120, 2025-05-07T20:33:05.5484906Z scale_ub=None, 2025-05-07T20:33:05.5485133Z contiguous=False, 2025-05-07T20:33:05.5485368Z compiled=False, 2025-05-07T20:33:05.5485576Z ) 2025-05-07T20:33:06.3419518Z self = 2025-05-07T20:33:06.3421011Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:06.3421925Z 2025-05-07T20:33:06.3422088Z @given( 2025-05-07T20:33:06.3422657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3423290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3423897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3424542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3425188Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3425755Z ) 2025-05-07T20:33:06.3426434Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3427390Z def test_silu_mul_quant( 2025-05-07T20:33:06.3427637Z self, 2025-05-07T20:33:06.3427838Z T: int, 2025-05-07T20:33:06.3428041Z D: int, 2025-05-07T20:33:06.3428266Z scale_ub: Optional[float], 2025-05-07T20:33:06.3428545Z contiguous: bool, 2025-05-07T20:33:06.3428787Z compiled: bool, 2025-05-07T20:33:06.3429020Z ) -> None: 2025-05-07T20:33:06.3429239Z torch.manual_seed(2025) 2025-05-07T20:33:06.3429481Z 2025-05-07T20:33:06.3429827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3430175Z 2025-05-07T20:33:06.3430370Z x_sign = torch.sign(x) 2025-05-07T20:33:06.3430664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.3430979Z x = x_sign * x_clamp 2025-05-07T20:33:06.3431219Z x0 = x[:, :D] 2025-05-07T20:33:06.3431445Z x1 = x[:, D:] 2025-05-07T20:33:06.3431663Z 2025-05-07T20:33:06.3431852Z if contiguous: 2025-05-07T20:33:06.3432095Z x0 = x0.contiguous() 2025-05-07T20:33:06.3432358Z x1 = x1.contiguous() 2025-05-07T20:33:06.3432596Z 2025-05-07T20:33:06.3432794Z if scale_ub is not None: 2025-05-07T20:33:06.3433078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.3433416Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.3433723Z ) 2025-05-07T20:33:06.3433923Z else: 2025-05-07T20:33:06.3434140Z scale_ub_tensor = None 2025-05-07T20:33:06.3434392Z 2025-05-07T20:33:06.3434632Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.3434952Z op = silu_mul_quant 2025-05-07T20:33:06.3435205Z if compiled: 2025-05-07T20:33:06.3435457Z op = torch.compile(op) 2025-05-07T20:33:06.3435755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3436027Z 2025-05-07T20:33:06.3436226Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.3436392Z 2025-05-07T20:33:06.3436503Z moe/activation_test.py:117: 2025-05-07T20:33:06.3436799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3437137Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.3437446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3438169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.3438851Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.3439389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.3440065Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.3440729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.3441254Z kernel = self.compile( 2025-05-07T20:33:06.3441800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.3442450Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.3442842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3443071Z 2025-05-07T20:33:06.3443331Z self = 2025-05-07T20:33:06.3444439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.3445795Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057d937240>} 2025-05-07T20:33:06.3447119Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.3448217Z context = 2025-05-07T20:33:06.3448505Z 2025-05-07T20:33:06.3448670Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.3449185Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.3449690Z module_map=module_map) 2025-05-07T20:33:06.3450053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.3450403Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.3450662Z E ^ 2025-05-07T20:33:06.3451118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.3451565Z 2025-05-07T20:33:06.3451977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.3452490Z 2025-05-07T20:33:06.3452594Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.3453095Z self=, 2025-05-07T20:33:06.3453489Z T=4096, 2025-05-07T20:33:06.3453682Z D=7168, 2025-05-07T20:33:06.3453883Z scale_ub=None, 2025-05-07T20:33:06.3454095Z contiguous=False, 2025-05-07T20:33:06.3454327Z compiled=False, 2025-05-07T20:33:06.3454535Z ) 2025-05-07T20:33:06.3454854Z self = 2025-05-07T20:33:06.3455353Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:06.3455622Z 2025-05-07T20:33:06.3455705Z @given( 2025-05-07T20:33:06.3455932Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.3456246Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.3456556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.3456876Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.3457205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.3457523Z ) 2025-05-07T20:33:06.3457896Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.3458331Z def test_silu_mul_quant( 2025-05-07T20:33:06.3458576Z self, 2025-05-07T20:33:06.3458775Z T: int, 2025-05-07T20:33:06.3458969Z D: int, 2025-05-07T20:33:06.3459365Z scale_ub: Optional[float], 2025-05-07T20:33:06.3459646Z contiguous: bool, 2025-05-07T20:33:06.3459882Z compiled: bool, 2025-05-07T20:33:06.3460110Z ) -> None: 2025-05-07T20:33:06.3460326Z torch.manual_seed(2025) 2025-05-07T20:33:06.3460564Z 2025-05-07T20:33:06.3460835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.3461177Z 2025-05-07T20:33:06.3461368Z x_sign = torch.sign(x) 2025-05-07T20:33:06.3461658Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.3461965Z x = x_sign * x_clamp 2025-05-07T20:33:06.3462198Z x0 = x[:, :D] 2025-05-07T20:33:06.3462416Z x1 = x[:, D:] 2025-05-07T20:33:06.3462624Z 2025-05-07T20:33:06.3462808Z if contiguous: 2025-05-07T20:33:06.3463041Z x0 = x0.contiguous() 2025-05-07T20:33:06.3463378Z x1 = x1.contiguous() 2025-05-07T20:33:06.3463618Z 2025-05-07T20:33:06.3463868Z if scale_ub is not None: 2025-05-07T20:33:06.3464148Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.3464483Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.3464790Z ) 2025-05-07T20:33:06.3464985Z else: 2025-05-07T20:33:06.3465196Z scale_ub_tensor = None 2025-05-07T20:33:06.3465443Z 2025-05-07T20:33:06.3465679Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.3466052Z op = silu_mul_quant 2025-05-07T20:33:06.3466297Z if compiled: 2025-05-07T20:33:06.3466548Z op = torch.compile(op) 2025-05-07T20:33:06.3466844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3467111Z 2025-05-07T20:33:06.3467305Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.3467469Z 2025-05-07T20:33:06.3467579Z moe/activation_test.py:117: 2025-05-07T20:33:06.3467878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3468262Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.3468545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.3469226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.3469901Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.3470435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.3471111Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.3471771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.3472292Z kernel = self.compile( 2025-05-07T20:33:06.3472830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.3473483Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.3473873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.3474106Z 2025-05-07T20:33:06.3474312Z self = 2025-05-07T20:33:06.3475379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.3476728Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c1918a0>} 2025-05-07T20:33:06.3478048Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.3479057Z context = 2025-05-07T20:33:06.3479345Z 2025-05-07T20:33:06.3479510Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.3480027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.3480492Z module_map=module_map) 2025-05-07T20:33:06.3480853Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.3481208Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.3481464Z E ^ 2025-05-07T20:33:06.3481915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.3482362Z 2025-05-07T20:33:06.3482772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.3483407Z 2025-05-07T20:33:06.3483511Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.3483963Z self=, 2025-05-07T20:33:06.3484357Z T=128, 2025-05-07T20:33:06.3484548Z D=7168, 2025-05-07T20:33:06.3484745Z scale_ub=None, 2025-05-07T20:33:06.3484954Z contiguous=False, 2025-05-07T20:33:06.3485183Z compiled=True, 2025-05-07T20:33:06.3485394Z ) 2025-05-07T20:33:06.4040283Z self = 2025-05-07T20:33:06.4041676Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:06.4042217Z 2025-05-07T20:33:06.4042383Z @given( 2025-05-07T20:33:06.4042855Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.4043482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.4044089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.4044752Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.4045412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.4045988Z ) 2025-05-07T20:33:06.4046819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.4047467Z def test_silu_mul_quant( 2025-05-07T20:33:06.4047705Z self, 2025-05-07T20:33:06.4047903Z T: int, 2025-05-07T20:33:06.4048099Z D: int, 2025-05-07T20:33:06.4048314Z scale_ub: Optional[float], 2025-05-07T20:33:06.4048584Z contiguous: bool, 2025-05-07T20:33:06.4048828Z compiled: bool, 2025-05-07T20:33:06.4049050Z ) -> None: 2025-05-07T20:33:06.4049267Z torch.manual_seed(2025) 2025-05-07T20:33:06.4049511Z 2025-05-07T20:33:06.4049785Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.4050124Z 2025-05-07T20:33:06.4050325Z x_sign = torch.sign(x) 2025-05-07T20:33:06.4050619Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.4050929Z x = x_sign * x_clamp 2025-05-07T20:33:06.4051174Z x0 = x[:, :D] 2025-05-07T20:33:06.4051402Z x1 = x[:, D:] 2025-05-07T20:33:06.4051612Z 2025-05-07T20:33:06.4051805Z if contiguous: 2025-05-07T20:33:06.4052036Z x0 = x0.contiguous() 2025-05-07T20:33:06.4052287Z x1 = x1.contiguous() 2025-05-07T20:33:06.4052528Z 2025-05-07T20:33:06.4052717Z if scale_ub is not None: 2025-05-07T20:33:06.4053075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.4053413Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.4053724Z ) 2025-05-07T20:33:06.4053909Z else: 2025-05-07T20:33:06.4054121Z scale_ub_tensor = None 2025-05-07T20:33:06.4054372Z 2025-05-07T20:33:06.4054602Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.4054910Z op = silu_mul_quant 2025-05-07T20:33:06.4055163Z if compiled: 2025-05-07T20:33:06.4055412Z op = torch.compile(op) 2025-05-07T20:33:06.4055703Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.4055977Z 2025-05-07T20:33:06.4056163Z y_fp8, y_scale = fn() 2025-05-07T20:33:06.4056446Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:06.4056735Z 2025-05-07T20:33:06.4056973Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.4057297Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:06.4057588Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:06.4057898Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:06.4058246Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.4058556Z 2025-05-07T20:33:06.4058758Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:06.4058948Z 2025-05-07T20:33:06.4059052Z moe/activation_test.py:126: 2025-05-07T20:33:06.4059569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.4059900Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:06.4060320Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.4061094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:06.4061835Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:06.4062375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.4063105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.4063771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:06.4064478Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:06.4065200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:06.4065887Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:06.4066477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:06.4066984Z fn() 2025-05-07T20:33:06.4067514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:06.4068104Z self.fn.run( 2025-05-07T20:33:06.4068570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.4069091Z kernel = self.compile( 2025-05-07T20:33:06.4069627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.4070265Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.4070658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.4070882Z 2025-05-07T20:33:06.4071099Z self = 2025-05-07T20:33:06.4072160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.4073497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c0f5a80>} 2025-05-07T20:33:06.4074810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.4075811Z context = 2025-05-07T20:33:06.4076095Z 2025-05-07T20:33:06.4076266Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.4076771Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.4077231Z module_map=module_map) 2025-05-07T20:33:06.4077595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.4077996Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:06.4078255Z E ^ 2025-05-07T20:33:06.4078712Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.4079154Z 2025-05-07T20:33:06.4079565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.4080064Z 2025-05-07T20:33:06.4080172Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.4080650Z self=, 2025-05-07T20:33:06.4081043Z T=128, 2025-05-07T20:33:06.4081229Z D=7168, 2025-05-07T20:33:06.4081465Z scale_ub=None, 2025-05-07T20:33:06.4081679Z contiguous=False, 2025-05-07T20:33:06.4081904Z compiled=False, 2025-05-07T20:33:06.4082103Z ) 2025-05-07T20:33:06.6037300Z self = 2025-05-07T20:33:06.6037991Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:06.6038407Z 2025-05-07T20:33:06.6038657Z @given( 2025-05-07T20:33:06.6038974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.6044790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.6045134Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.6045469Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.6045793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.6046084Z ) 2025-05-07T20:33:06.6046433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.6046988Z def test_silu_mul_quant( 2025-05-07T20:33:06.6047250Z self, 2025-05-07T20:33:06.6047467Z T: int, 2025-05-07T20:33:06.6047690Z D: int, 2025-05-07T20:33:06.6047911Z scale_ub: Optional[float], 2025-05-07T20:33:06.6048177Z contiguous: bool, 2025-05-07T20:33:06.6048424Z compiled: bool, 2025-05-07T20:33:06.6048651Z ) -> None: 2025-05-07T20:33:06.6048861Z torch.manual_seed(2025) 2025-05-07T20:33:06.6049119Z 2025-05-07T20:33:06.6049398Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.6049737Z 2025-05-07T20:33:06.6049937Z x_sign = torch.sign(x) 2025-05-07T20:33:06.6050233Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.6050533Z x = x_sign * x_clamp 2025-05-07T20:33:06.6050777Z x0 = x[:, :D] 2025-05-07T20:33:06.6050993Z x1 = x[:, D:] 2025-05-07T20:33:06.6051197Z 2025-05-07T20:33:06.6051393Z if contiguous: 2025-05-07T20:33:06.6051633Z x0 = x0.contiguous() 2025-05-07T20:33:06.6051886Z x1 = x1.contiguous() 2025-05-07T20:33:06.6052125Z 2025-05-07T20:33:06.6052320Z if scale_ub is not None: 2025-05-07T20:33:06.6052594Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.6052920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.6053322Z ) 2025-05-07T20:33:06.6053520Z else: 2025-05-07T20:33:06.6053726Z scale_ub_tensor = None 2025-05-07T20:33:06.6053984Z 2025-05-07T20:33:06.6054213Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.6054520Z op = silu_mul_quant 2025-05-07T20:33:06.6054771Z if compiled: 2025-05-07T20:33:06.6055021Z op = torch.compile(op) 2025-05-07T20:33:06.6055307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.6055581Z 2025-05-07T20:33:06.6055775Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.6055940Z 2025-05-07T20:33:06.6056053Z moe/activation_test.py:117: 2025-05-07T20:33:06.6056345Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.6056673Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.6056952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.6057634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.6058318Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.6058850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.6059712Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.6060361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.6060967Z kernel = self.compile( 2025-05-07T20:33:06.6061569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.6062211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.6062604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.6062835Z 2025-05-07T20:33:06.6063045Z self = 2025-05-07T20:33:06.6064168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.6065516Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553f30860>} 2025-05-07T20:33:06.6066896Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.6067906Z context = 2025-05-07T20:33:06.6068192Z 2025-05-07T20:33:06.6068366Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.6068880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.6069348Z module_map=module_map) 2025-05-07T20:33:06.6069717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.6070074Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.6070331Z E ^ 2025-05-07T20:33:06.6070794Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.6071240Z 2025-05-07T20:33:06.6071663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.6072162Z 2025-05-07T20:33:06.6072272Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.6072676Z self=, 2025-05-07T20:33:06.6073074Z T=4096, 2025-05-07T20:33:06.6073264Z D=5120, 2025-05-07T20:33:06.6073453Z scale_ub=1200.0, 2025-05-07T20:33:06.6073677Z contiguous=True, 2025-05-07T20:33:06.6073901Z compiled=False, 2025-05-07T20:33:06.6074101Z ) 2025-05-07T20:33:06.6074419Z self = 2025-05-07T20:33:06.6074907Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:06.6075171Z 2025-05-07T20:33:06.6075254Z @given( 2025-05-07T20:33:06.6075479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.6075792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.6076097Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.6076426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.6076755Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.6077039Z ) 2025-05-07T20:33:06.6077383Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.6077871Z def test_silu_mul_quant( 2025-05-07T20:33:06.6078110Z self, 2025-05-07T20:33:06.6078306Z T: int, 2025-05-07T20:33:06.6078505Z D: int, 2025-05-07T20:33:06.6078728Z scale_ub: Optional[float], 2025-05-07T20:33:06.6078991Z contiguous: bool, 2025-05-07T20:33:06.6079237Z compiled: bool, 2025-05-07T20:33:06.6079463Z ) -> None: 2025-05-07T20:33:06.6079677Z torch.manual_seed(2025) 2025-05-07T20:33:06.6079926Z 2025-05-07T20:33:06.6080263Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.6080602Z 2025-05-07T20:33:06.6080797Z x_sign = torch.sign(x) 2025-05-07T20:33:06.6081138Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.6081457Z x = x_sign * x_clamp 2025-05-07T20:33:06.6081695Z x0 = x[:, :D] 2025-05-07T20:33:06.6081917Z x1 = x[:, D:] 2025-05-07T20:33:06.6082125Z 2025-05-07T20:33:06.6082311Z if contiguous: 2025-05-07T20:33:06.6082548Z x0 = x0.contiguous() 2025-05-07T20:33:06.6082816Z x1 = x1.contiguous() 2025-05-07T20:33:06.6083104Z 2025-05-07T20:33:06.6083309Z if scale_ub is not None: 2025-05-07T20:33:06.6083590Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.6083922Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.6084231Z ) 2025-05-07T20:33:06.6084425Z else: 2025-05-07T20:33:06.6084631Z scale_ub_tensor = None 2025-05-07T20:33:06.6084890Z 2025-05-07T20:33:06.6085123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.6085435Z op = silu_mul_quant 2025-05-07T20:33:06.6085719Z if compiled: 2025-05-07T20:33:06.6085966Z op = torch.compile(op) 2025-05-07T20:33:06.6086267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.6086532Z 2025-05-07T20:33:06.6086730Z > y_fp8, y_scale = fn() 2025-05-07T20:33:06.6086894Z 2025-05-07T20:33:06.6086999Z moe/activation_test.py:117: 2025-05-07T20:33:06.6087289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.6087617Z moe/activation_test.py:115: in fn 2025-05-07T20:33:06.6087896Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.6088569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:06.6089249Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:06.6089781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.6090456Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.6091101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.6091623Z kernel = self.compile( 2025-05-07T20:33:06.6092158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.6092806Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.6093243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.6093473Z 2025-05-07T20:33:06.6093678Z self = 2025-05-07T20:33:06.6094742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.6096084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553f314e0>} 2025-05-07T20:33:06.6097393Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.6098399Z context = 2025-05-07T20:33:06.6098685Z 2025-05-07T20:33:06.6098849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.6099365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.6099904Z module_map=module_map) 2025-05-07T20:33:06.6100267Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.6100663Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.6100921Z E ^ 2025-05-07T20:33:06.6101386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.6101842Z 2025-05-07T20:33:06.6102262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.6102811Z 2025-05-07T20:33:06.6102921Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.6103339Z self=, 2025-05-07T20:33:06.6103740Z T=1, 2025-05-07T20:33:06.6103925Z D=5120, 2025-05-07T20:33:06.6104121Z scale_ub=None, 2025-05-07T20:33:06.6104334Z contiguous=True, 2025-05-07T20:33:06.6104560Z compiled=True, 2025-05-07T20:33:06.6104769Z ) 2025-05-07T20:33:06.9902452Z self = 2025-05-07T20:33:06.9903327Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:06.9903686Z 2025-05-07T20:33:06.9903797Z @given( 2025-05-07T20:33:06.9904116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.9904430Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.9904739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.9905071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.9905399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.9905685Z ) 2025-05-07T20:33:06.9906037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.9906481Z def test_silu_mul_quant( 2025-05-07T20:33:06.9906721Z self, 2025-05-07T20:33:06.9906917Z T: int, 2025-05-07T20:33:06.9907120Z D: int, 2025-05-07T20:33:06.9907340Z scale_ub: Optional[float], 2025-05-07T20:33:06.9907635Z contiguous: bool, 2025-05-07T20:33:06.9907909Z compiled: bool, 2025-05-07T20:33:06.9908135Z ) -> None: 2025-05-07T20:33:06.9908355Z torch.manual_seed(2025) 2025-05-07T20:33:06.9908600Z 2025-05-07T20:33:06.9908868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.9909203Z 2025-05-07T20:33:06.9909392Z x_sign = torch.sign(x) 2025-05-07T20:33:06.9909675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.9909988Z x = x_sign * x_clamp 2025-05-07T20:33:06.9910229Z x0 = x[:, :D] 2025-05-07T20:33:06.9910439Z x1 = x[:, D:] 2025-05-07T20:33:06.9910642Z 2025-05-07T20:33:06.9910827Z if contiguous: 2025-05-07T20:33:06.9911058Z x0 = x0.contiguous() 2025-05-07T20:33:06.9911308Z x1 = x1.contiguous() 2025-05-07T20:33:06.9911558Z 2025-05-07T20:33:06.9911750Z if scale_ub is not None: 2025-05-07T20:33:06.9912027Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.9912364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.9912674Z ) 2025-05-07T20:33:06.9912868Z else: 2025-05-07T20:33:06.9913082Z scale_ub_tensor = None 2025-05-07T20:33:06.9913333Z 2025-05-07T20:33:06.9913563Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.9913869Z op = silu_mul_quant 2025-05-07T20:33:06.9914117Z if compiled: 2025-05-07T20:33:06.9914362Z op = torch.compile(op) 2025-05-07T20:33:06.9914653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.9914927Z 2025-05-07T20:33:06.9915115Z y_fp8, y_scale = fn() 2025-05-07T20:33:06.9915399Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:06.9915683Z 2025-05-07T20:33:06.9915914Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.9916329Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:06.9916617Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:06.9916984Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:06.9917344Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.9917647Z 2025-05-07T20:33:06.9917861Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:06.9918082Z 2025-05-07T20:33:06.9918185Z moe/activation_test.py:126: 2025-05-07T20:33:06.9918480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9918873Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:06.9919188Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.9919964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:06.9920703Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:06.9921245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.9921952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.9922629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:06.9923335Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:06.9924052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:06.9924678Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:06.9925270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:06.9925781Z fn() 2025-05-07T20:33:06.9926276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:06.9926848Z self.fn.run( 2025-05-07T20:33:06.9927318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.9927853Z kernel = self.compile( 2025-05-07T20:33:06.9928426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.9929066Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.9929462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.9929688Z 2025-05-07T20:33:06.9929891Z self = 2025-05-07T20:33:06.9930958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.9932315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553f32d40>} 2025-05-07T20:33:06.9933706Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.9934724Z context = 2025-05-07T20:33:06.9935008Z 2025-05-07T20:33:06.9935176Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.9935688Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.9936157Z module_map=module_map) 2025-05-07T20:33:06.9936523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.9936926Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:06.9937196Z E ^ 2025-05-07T20:33:06.9937706Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.9938199Z 2025-05-07T20:33:06.9938617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.9939121Z 2025-05-07T20:33:06.9939225Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9939632Z self=, 2025-05-07T20:33:06.9940063Z T=2048, 2025-05-07T20:33:06.9940244Z D=5120, 2025-05-07T20:33:06.9940439Z scale_ub=None, 2025-05-07T20:33:06.9940653Z contiguous=True, 2025-05-07T20:33:06.9940869Z compiled=True, 2025-05-07T20:33:06.9941071Z ) 2025-05-07T20:33:07.3588709Z self = 2025-05-07T20:33:07.3589470Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.3589844Z 2025-05-07T20:33:07.3589962Z @given( 2025-05-07T20:33:07.3590420Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.3590832Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.3591240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.3591618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.3591941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.3592228Z ) 2025-05-07T20:33:07.3592568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.3593009Z def test_silu_mul_quant( 2025-05-07T20:33:07.3593249Z self, 2025-05-07T20:33:07.3593447Z T: int, 2025-05-07T20:33:07.3593644Z D: int, 2025-05-07T20:33:07.3593861Z scale_ub: Optional[float], 2025-05-07T20:33:07.3594126Z contiguous: bool, 2025-05-07T20:33:07.3594360Z compiled: bool, 2025-05-07T20:33:07.3594585Z ) -> None: 2025-05-07T20:33:07.3594797Z torch.manual_seed(2025) 2025-05-07T20:33:07.3595030Z 2025-05-07T20:33:07.3595300Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.3595643Z 2025-05-07T20:33:07.3595831Z x_sign = torch.sign(x) 2025-05-07T20:33:07.3596119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.3596420Z x = x_sign * x_clamp 2025-05-07T20:33:07.3596652Z x0 = x[:, :D] 2025-05-07T20:33:07.3596872Z x1 = x[:, D:] 2025-05-07T20:33:07.3597086Z 2025-05-07T20:33:07.3597268Z if contiguous: 2025-05-07T20:33:07.3597499Z x0 = x0.contiguous() 2025-05-07T20:33:07.3597756Z x1 = x1.contiguous() 2025-05-07T20:33:07.3597991Z 2025-05-07T20:33:07.3598187Z if scale_ub is not None: 2025-05-07T20:33:07.3598463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.3598795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.3599097Z ) 2025-05-07T20:33:07.3599295Z else: 2025-05-07T20:33:07.3599512Z scale_ub_tensor = None 2025-05-07T20:33:07.3599762Z 2025-05-07T20:33:07.3599995Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.3600308Z op = silu_mul_quant 2025-05-07T20:33:07.3600553Z if compiled: 2025-05-07T20:33:07.3600806Z op = torch.compile(op) 2025-05-07T20:33:07.3601105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.3601375Z 2025-05-07T20:33:07.3601566Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.3601850Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.3602135Z 2025-05-07T20:33:07.3602376Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.3602712Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.3602999Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.3603380Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.3603732Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.3604107Z 2025-05-07T20:33:07.3604307Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.3604501Z 2025-05-07T20:33:07.3604602Z moe/activation_test.py:126: 2025-05-07T20:33:07.3604899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.3605228Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.3605549Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.3606421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.3607161Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.3607695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.3608421Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.3609142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.3609855Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.3610565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.3611192Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.3611789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.3612291Z fn() 2025-05-07T20:33:07.3612792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.3613454Z self.fn.run( 2025-05-07T20:33:07.3613919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.3614444Z kernel = self.compile( 2025-05-07T20:33:07.3614986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.3615629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.3616017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.3616248Z 2025-05-07T20:33:07.3616454Z self = 2025-05-07T20:33:07.3617519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.3618918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c1ede40>} 2025-05-07T20:33:07.3620242Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.3621247Z context = 2025-05-07T20:33:07.3621537Z 2025-05-07T20:33:07.3621701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.3622212Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.3622673Z module_map=module_map) 2025-05-07T20:33:07.3623028Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.3623381Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.3623644Z E ^ 2025-05-07T20:33:07.3624100Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.3624604Z 2025-05-07T20:33:07.3625051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.3625554Z 2025-05-07T20:33:07.3625655Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.3626061Z self=, 2025-05-07T20:33:07.3626449Z T=128, 2025-05-07T20:33:07.3626636Z D=5120, 2025-05-07T20:33:07.3626825Z scale_ub=None, 2025-05-07T20:33:07.3627075Z contiguous=True, 2025-05-07T20:33:07.3627296Z compiled=True, 2025-05-07T20:33:07.3627498Z ) 2025-05-07T20:33:07.7888902Z self = 2025-05-07T20:33:07.7889673Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:07.7890031Z 2025-05-07T20:33:07.7890146Z @given( 2025-05-07T20:33:07.7890414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.7890730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.7891163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.7891501Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.7891826Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.7892114Z ) 2025-05-07T20:33:07.7892460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.7892897Z def test_silu_mul_quant( 2025-05-07T20:33:07.7893218Z self, 2025-05-07T20:33:07.7893420Z T: int, 2025-05-07T20:33:07.7893611Z D: int, 2025-05-07T20:33:07.7893840Z scale_ub: Optional[float], 2025-05-07T20:33:07.7899167Z contiguous: bool, 2025-05-07T20:33:07.7899459Z compiled: bool, 2025-05-07T20:33:07.7899691Z ) -> None: 2025-05-07T20:33:07.7899915Z torch.manual_seed(2025) 2025-05-07T20:33:07.7900159Z 2025-05-07T20:33:07.7900454Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7900806Z 2025-05-07T20:33:07.7901008Z x_sign = torch.sign(x) 2025-05-07T20:33:07.7901308Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.7901622Z x = x_sign * x_clamp 2025-05-07T20:33:07.7901867Z x0 = x[:, :D] 2025-05-07T20:33:07.7902094Z x1 = x[:, D:] 2025-05-07T20:33:07.7902310Z 2025-05-07T20:33:07.7902496Z if contiguous: 2025-05-07T20:33:07.7902737Z x0 = x0.contiguous() 2025-05-07T20:33:07.7903007Z x1 = x1.contiguous() 2025-05-07T20:33:07.7903257Z 2025-05-07T20:33:07.7903461Z if scale_ub is not None: 2025-05-07T20:33:07.7903735Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.7904079Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.7904387Z ) 2025-05-07T20:33:07.7904590Z else: 2025-05-07T20:33:07.7904801Z scale_ub_tensor = None 2025-05-07T20:33:07.7905054Z 2025-05-07T20:33:07.7905298Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.7905628Z op = silu_mul_quant 2025-05-07T20:33:07.7905876Z if compiled: 2025-05-07T20:33:07.7906129Z op = torch.compile(op) 2025-05-07T20:33:07.7906431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.7906708Z 2025-05-07T20:33:07.7906908Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.7907197Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.7907490Z 2025-05-07T20:33:07.7907732Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.7908085Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.7908404Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.7908714Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.7909072Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.7909503Z 2025-05-07T20:33:07.7909707Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.7909907Z 2025-05-07T20:33:07.7910070Z moe/activation_test.py:126: 2025-05-07T20:33:07.7910366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7910693Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.7911016Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.7911794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.7912599Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.7913132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.7913803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.7914479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.7915236Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.7915948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.7916580Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.7917176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.7917683Z fn() 2025-05-07T20:33:07.7918187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.7918810Z self.fn.run( 2025-05-07T20:33:07.7919272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.7919798Z kernel = self.compile( 2025-05-07T20:33:07.7920331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.7920972Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.7921359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7921589Z 2025-05-07T20:33:07.7921793Z self = 2025-05-07T20:33:07.7922863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.7924213Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552f0dd00>} 2025-05-07T20:33:07.7925524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.7926531Z context = 2025-05-07T20:33:07.7926822Z 2025-05-07T20:33:07.7926984Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.7927499Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.7927961Z module_map=module_map) 2025-05-07T20:33:07.7928349Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.7928725Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.7928994Z E ^ 2025-05-07T20:33:07.7929447Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.7929893Z 2025-05-07T20:33:07.7930300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.7930856Z 2025-05-07T20:33:07.7930993Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.7931404Z self=, 2025-05-07T20:33:07.7931800Z T=4096, 2025-05-07T20:33:07.7931989Z D=5120, 2025-05-07T20:33:07.7932178Z scale_ub=None, 2025-05-07T20:33:07.7932386Z contiguous=True, 2025-05-07T20:33:07.7932609Z compiled=True, 2025-05-07T20:33:07.7932817Z ) 2025-05-07T20:33:08.2234585Z self = 2025-05-07T20:33:08.2235456Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.2235818Z 2025-05-07T20:33:08.2235930Z @given( 2025-05-07T20:33:08.2236217Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.2236532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.2236841Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.2237173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.2237585Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.2237874Z ) 2025-05-07T20:33:08.2238218Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.2238653Z def test_silu_mul_quant( 2025-05-07T20:33:08.2238894Z self, 2025-05-07T20:33:08.2239084Z T: int, 2025-05-07T20:33:08.2239280Z D: int, 2025-05-07T20:33:08.2239500Z scale_ub: Optional[float], 2025-05-07T20:33:08.2239767Z contiguous: bool, 2025-05-07T20:33:08.2240007Z compiled: bool, 2025-05-07T20:33:08.2240232Z ) -> None: 2025-05-07T20:33:08.2240441Z torch.manual_seed(2025) 2025-05-07T20:33:08.2240682Z 2025-05-07T20:33:08.2240963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.2241297Z 2025-05-07T20:33:08.2241498Z x_sign = torch.sign(x) 2025-05-07T20:33:08.2241788Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.2242097Z x = x_sign * x_clamp 2025-05-07T20:33:08.2242345Z x0 = x[:, :D] 2025-05-07T20:33:08.2242559Z x1 = x[:, D:] 2025-05-07T20:33:08.2242762Z 2025-05-07T20:33:08.2242945Z if contiguous: 2025-05-07T20:33:08.2243180Z x0 = x0.contiguous() 2025-05-07T20:33:08.2243448Z x1 = x1.contiguous() 2025-05-07T20:33:08.2243680Z 2025-05-07T20:33:08.2243874Z if scale_ub is not None: 2025-05-07T20:33:08.2244150Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.2244476Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.2244785Z ) 2025-05-07T20:33:08.2244981Z else: 2025-05-07T20:33:08.2245192Z scale_ub_tensor = None 2025-05-07T20:33:08.2245445Z 2025-05-07T20:33:08.2245683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.2245995Z op = silu_mul_quant 2025-05-07T20:33:08.2246248Z if compiled: 2025-05-07T20:33:08.2246494Z op = torch.compile(op) 2025-05-07T20:33:08.2246795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2247070Z 2025-05-07T20:33:08.2247267Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.2247552Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.2247836Z 2025-05-07T20:33:08.2248082Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.2248420Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.2248705Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.2249020Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.2249376Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.2249680Z 2025-05-07T20:33:08.2249879Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.2250069Z 2025-05-07T20:33:08.2250250Z moe/activation_test.py:126: 2025-05-07T20:33:08.2250547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2250967Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.2251296Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.2252076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.2252818Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.2253463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.2254189Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.2254871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.2255579Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.2256305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.2256981Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.2257584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.2258093Z fn() 2025-05-07T20:33:08.2258599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.2259376Z self.fn.run( 2025-05-07T20:33:08.2259916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.2260452Z kernel = self.compile( 2025-05-07T20:33:08.2260988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.2261640Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.2262031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2262265Z 2025-05-07T20:33:08.2262469Z self = 2025-05-07T20:33:08.2263539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.2264884Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055335b4c0>} 2025-05-07T20:33:08.2266199Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.2267205Z context = 2025-05-07T20:33:08.2267495Z 2025-05-07T20:33:08.2267664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.2268176Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.2268634Z module_map=module_map) 2025-05-07T20:33:08.2269002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.2269354Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.2269623Z E ^ 2025-05-07T20:33:08.2270075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.2270522Z 2025-05-07T20:33:08.2270932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.2271432Z 2025-05-07T20:33:08.2271630Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.2272041Z self=, 2025-05-07T20:33:08.2272489Z T=16384, 2025-05-07T20:33:08.2272688Z D=5120, 2025-05-07T20:33:08.2272879Z scale_ub=None, 2025-05-07T20:33:08.2273085Z contiguous=True, 2025-05-07T20:33:08.2273308Z compiled=True, 2025-05-07T20:33:08.2273517Z ) 2025-05-07T20:33:08.2531555Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:08.2533088Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:08.2534526Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:08.2535510Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:08.2536649Z W0507 20:33:08.251000 99481 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:08.3409208Z self = 2025-05-07T20:33:08.3409971Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.3410361Z 2025-05-07T20:33:08.3410473Z @given( 2025-05-07T20:33:08.3410761Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.3411077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.3411389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.3411725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.3412054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.3412343Z ) 2025-05-07T20:33:08.3412697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.3413212Z def test_silu_mul_quant( 2025-05-07T20:33:08.3413454Z self, 2025-05-07T20:33:08.3413659Z T: int, 2025-05-07T20:33:08.3413866Z D: int, 2025-05-07T20:33:08.3414088Z scale_ub: Optional[float], 2025-05-07T20:33:08.3414365Z contiguous: bool, 2025-05-07T20:33:08.3414612Z compiled: bool, 2025-05-07T20:33:08.3414838Z ) -> None: 2025-05-07T20:33:08.3415059Z torch.manual_seed(2025) 2025-05-07T20:33:08.3415304Z 2025-05-07T20:33:08.3415578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.3415923Z 2025-05-07T20:33:08.3416127Z x_sign = torch.sign(x) 2025-05-07T20:33:08.3416414Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.3416727Z x = x_sign * x_clamp 2025-05-07T20:33:08.3416983Z x0 = x[:, :D] 2025-05-07T20:33:08.3417198Z x1 = x[:, D:] 2025-05-07T20:33:08.3417411Z 2025-05-07T20:33:08.3417606Z if contiguous: 2025-05-07T20:33:08.3417843Z x0 = x0.contiguous() 2025-05-07T20:33:08.3418104Z x1 = x1.contiguous() 2025-05-07T20:33:08.3418351Z 2025-05-07T20:33:08.3418553Z if scale_ub is not None: 2025-05-07T20:33:08.3418827Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.3419167Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.3419488Z ) 2025-05-07T20:33:08.3419681Z else: 2025-05-07T20:33:08.3419898Z scale_ub_tensor = None 2025-05-07T20:33:08.3420153Z 2025-05-07T20:33:08.3420384Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.3420700Z op = silu_mul_quant 2025-05-07T20:33:08.3420955Z if compiled: 2025-05-07T20:33:08.3421201Z op = torch.compile(op) 2025-05-07T20:33:08.3421617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.3421893Z 2025-05-07T20:33:08.3422154Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.3422441Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.3422729Z 2025-05-07T20:33:08.3422963Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.3423291Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.3423579Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.3423890Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.3424308Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.3424618Z 2025-05-07T20:33:08.3424818Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.3425010Z 2025-05-07T20:33:08.3425109Z moe/activation_test.py:126: 2025-05-07T20:33:08.3425397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.3425732Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.3426055Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.3426895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.3427638Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.3428185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.3428897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.3429579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.3430290Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.3431005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.3431630Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.3432226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.3432734Z fn() 2025-05-07T20:33:08.3433234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.3433798Z self.fn.run( 2025-05-07T20:33:08.3434260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.3434786Z kernel = self.compile( 2025-05-07T20:33:08.3435318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.3435963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.3436353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.3436580Z 2025-05-07T20:33:08.3436786Z self = 2025-05-07T20:33:08.3437849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.3439194Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05528c9580>} 2025-05-07T20:33:08.3440517Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.3441530Z context = 2025-05-07T20:33:08.3441811Z 2025-05-07T20:33:08.3442032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.3442572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.3443030Z module_map=module_map) 2025-05-07T20:33:08.3443397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.3443746Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.3444008Z E ^ 2025-05-07T20:33:08.3444462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.3445013Z 2025-05-07T20:33:08.3445429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.3445935Z 2025-05-07T20:33:08.3446039Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.3446450Z self=, 2025-05-07T20:33:08.3446846Z T=1, 2025-05-07T20:33:08.3447020Z D=5120, 2025-05-07T20:33:08.3447214Z scale_ub=1200.0, 2025-05-07T20:33:08.3447435Z contiguous=True, 2025-05-07T20:33:08.3447699Z compiled=True, 2025-05-07T20:33:08.3447899Z ) 2025-05-07T20:33:08.4853429Z self = 2025-05-07T20:33:08.4854169Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:08.4854525Z 2025-05-07T20:33:08.4854625Z @given( 2025-05-07T20:33:08.4854852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.4855168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.4855471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.4855798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.4856125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.4856410Z ) 2025-05-07T20:33:08.4856750Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.4857193Z def test_silu_mul_quant( 2025-05-07T20:33:08.4857436Z self, 2025-05-07T20:33:08.4857632Z T: int, 2025-05-07T20:33:08.4857830Z D: int, 2025-05-07T20:33:08.4858050Z scale_ub: Optional[float], 2025-05-07T20:33:08.4858326Z contiguous: bool, 2025-05-07T20:33:08.4858561Z compiled: bool, 2025-05-07T20:33:08.4858787Z ) -> None: 2025-05-07T20:33:08.4859006Z torch.manual_seed(2025) 2025-05-07T20:33:08.4859412Z 2025-05-07T20:33:08.4859681Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.4860029Z 2025-05-07T20:33:08.4860224Z x_sign = torch.sign(x) 2025-05-07T20:33:08.4860509Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.4860818Z x = x_sign * x_clamp 2025-05-07T20:33:08.4861055Z x0 = x[:, :D] 2025-05-07T20:33:08.4861272Z x1 = x[:, D:] 2025-05-07T20:33:08.4861478Z 2025-05-07T20:33:08.4861663Z if contiguous: 2025-05-07T20:33:08.4861897Z x0 = x0.contiguous() 2025-05-07T20:33:08.4862151Z x1 = x1.contiguous() 2025-05-07T20:33:08.4862390Z 2025-05-07T20:33:08.4862582Z if scale_ub is not None: 2025-05-07T20:33:08.4862853Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.4863183Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.4863485Z ) 2025-05-07T20:33:08.4863685Z else: 2025-05-07T20:33:08.4863895Z scale_ub_tensor = None 2025-05-07T20:33:08.4864141Z 2025-05-07T20:33:08.4864370Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.4864677Z op = silu_mul_quant 2025-05-07T20:33:08.4864920Z if compiled: 2025-05-07T20:33:08.4865167Z op = torch.compile(op) 2025-05-07T20:33:08.4865460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4865729Z 2025-05-07T20:33:08.4865917Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.4866228Z 2025-05-07T20:33:08.4866332Z moe/activation_test.py:117: 2025-05-07T20:33:08.4866685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4867015Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.4867288Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.4867838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.4868384Z return fn(*args, **kwargs) 2025-05-07T20:33:08.4869033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.4869774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.4870298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.4870964Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.4871616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.4872206Z kernel = self.compile( 2025-05-07T20:33:08.4872741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.4873380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.4873774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.4874001Z 2025-05-07T20:33:08.4874213Z self = 2025-05-07T20:33:08.4875272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.4876615Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553854860>} 2025-05-07T20:33:08.4878102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.4879161Z context = 2025-05-07T20:33:08.4879445Z 2025-05-07T20:33:08.4879607Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.4880124Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.4880586Z module_map=module_map) 2025-05-07T20:33:08.4880949Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.4881298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.4881557Z E ^ 2025-05-07T20:33:08.4882012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.4882449Z 2025-05-07T20:33:08.4882861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.4883367Z 2025-05-07T20:33:08.4883471Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.4883875Z self=, 2025-05-07T20:33:08.4884268Z T=1, 2025-05-07T20:33:08.4884446Z D=5120, 2025-05-07T20:33:08.4884644Z scale_ub=None, 2025-05-07T20:33:08.4884856Z contiguous=False, 2025-05-07T20:33:08.4885075Z compiled=True, 2025-05-07T20:33:08.4885278Z ) 2025-05-07T20:33:08.7021289Z self = 2025-05-07T20:33:08.7029652Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.7030016Z 2025-05-07T20:33:08.7030257Z @given( 2025-05-07T20:33:08.7030580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.7031002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.7031316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.7031645Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.7031972Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.7032251Z ) 2025-05-07T20:33:08.7032602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.7033111Z def test_silu_mul_quant( 2025-05-07T20:33:08.7033346Z self, 2025-05-07T20:33:08.7033545Z T: int, 2025-05-07T20:33:08.7033743Z D: int, 2025-05-07T20:33:08.7033953Z scale_ub: Optional[float], 2025-05-07T20:33:08.7034223Z contiguous: bool, 2025-05-07T20:33:08.7034464Z compiled: bool, 2025-05-07T20:33:08.7034687Z ) -> None: 2025-05-07T20:33:08.7034902Z torch.manual_seed(2025) 2025-05-07T20:33:08.7035143Z 2025-05-07T20:33:08.7035409Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.7035754Z 2025-05-07T20:33:08.7036006Z x_sign = torch.sign(x) 2025-05-07T20:33:08.7036292Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.7036605Z x = x_sign * x_clamp 2025-05-07T20:33:08.7036851Z x0 = x[:, :D] 2025-05-07T20:33:08.7037060Z x1 = x[:, D:] 2025-05-07T20:33:08.7037269Z 2025-05-07T20:33:08.7037456Z if contiguous: 2025-05-07T20:33:08.7037692Z x0 = x0.contiguous() 2025-05-07T20:33:08.7037941Z x1 = x1.contiguous() 2025-05-07T20:33:08.7038182Z 2025-05-07T20:33:08.7038374Z if scale_ub is not None: 2025-05-07T20:33:08.7038641Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.7038969Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.7039270Z ) 2025-05-07T20:33:08.7039459Z else: 2025-05-07T20:33:08.7039673Z scale_ub_tensor = None 2025-05-07T20:33:08.7039920Z 2025-05-07T20:33:08.7040152Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.7040463Z op = silu_mul_quant 2025-05-07T20:33:08.7040710Z if compiled: 2025-05-07T20:33:08.7040960Z op = torch.compile(op) 2025-05-07T20:33:08.7041255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.7041526Z 2025-05-07T20:33:08.7041720Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.7042000Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.7042295Z 2025-05-07T20:33:08.7042531Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.7042855Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.7043144Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.7043457Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.7043814Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.7044122Z 2025-05-07T20:33:08.7044328Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.7044523Z 2025-05-07T20:33:08.7044624Z moe/activation_test.py:126: 2025-05-07T20:33:08.7044921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7045252Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.7045574Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.7046350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.7047096Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.7047639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.7048317Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.7049045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.7049807Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.7050524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.7051149Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.7051741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.7052295Z fn() 2025-05-07T20:33:08.7052801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.7053445Z self.fn.run( 2025-05-07T20:33:08.7053913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.7054441Z kernel = self.compile( 2025-05-07T20:33:08.7054988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.7055670Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.7056061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7056289Z 2025-05-07T20:33:08.7056503Z self = 2025-05-07T20:33:08.7057576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.7059013Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553856b60>} 2025-05-07T20:33:08.7060550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.7061561Z context = 2025-05-07T20:33:08.7061847Z 2025-05-07T20:33:08.7062022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.7062537Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.7063010Z module_map=module_map) 2025-05-07T20:33:08.7063378Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.7063748Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.7064032Z E ^ 2025-05-07T20:33:08.7064498Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.7064946Z 2025-05-07T20:33:08.7065368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.7065875Z 2025-05-07T20:33:08.7065986Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.7066399Z self=, 2025-05-07T20:33:08.7066796Z T=1, 2025-05-07T20:33:08.7066982Z D=5120, 2025-05-07T20:33:08.7067178Z scale_ub=None, 2025-05-07T20:33:08.7067399Z contiguous=True, 2025-05-07T20:33:08.7067625Z compiled=False, 2025-05-07T20:33:08.7067830Z ) 2025-05-07T20:33:08.8550319Z self = 2025-05-07T20:33:08.8551082Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.8551442Z 2025-05-07T20:33:08.8551556Z @given( 2025-05-07T20:33:08.8551869Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.8552306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.8552857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.8553319Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.8553665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.8553954Z ) 2025-05-07T20:33:08.8554300Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.8554747Z def test_silu_mul_quant( 2025-05-07T20:33:08.8554989Z self, 2025-05-07T20:33:08.8555194Z T: int, 2025-05-07T20:33:08.8555384Z D: int, 2025-05-07T20:33:08.8555669Z scale_ub: Optional[float], 2025-05-07T20:33:08.8555938Z contiguous: bool, 2025-05-07T20:33:08.8556172Z compiled: bool, 2025-05-07T20:33:08.8556394Z ) -> None: 2025-05-07T20:33:08.8556610Z torch.manual_seed(2025) 2025-05-07T20:33:08.8556848Z 2025-05-07T20:33:08.8557118Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.8557462Z 2025-05-07T20:33:08.8557648Z x_sign = torch.sign(x) 2025-05-07T20:33:08.8557942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.8558343Z x = x_sign * x_clamp 2025-05-07T20:33:08.8558584Z x0 = x[:, :D] 2025-05-07T20:33:08.8558798Z x1 = x[:, D:] 2025-05-07T20:33:08.8559008Z 2025-05-07T20:33:08.8559385Z if contiguous: 2025-05-07T20:33:08.8559625Z x0 = x0.contiguous() 2025-05-07T20:33:08.8559892Z x1 = x1.contiguous() 2025-05-07T20:33:08.8560127Z 2025-05-07T20:33:08.8560327Z if scale_ub is not None: 2025-05-07T20:33:08.8560597Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.8560933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.8561240Z ) 2025-05-07T20:33:08.8561440Z else: 2025-05-07T20:33:08.8561687Z scale_ub_tensor = None 2025-05-07T20:33:08.8561941Z 2025-05-07T20:33:08.8562171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.8562493Z op = silu_mul_quant 2025-05-07T20:33:08.8562740Z if compiled: 2025-05-07T20:33:08.8562995Z op = torch.compile(op) 2025-05-07T20:33:08.8563289Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8563564Z 2025-05-07T20:33:08.8563758Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.8563920Z 2025-05-07T20:33:08.8564019Z moe/activation_test.py:117: 2025-05-07T20:33:08.8564320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8564660Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.8564932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8565620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.8566303Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.8566839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.8567512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.8568166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.8568743Z kernel = self.compile( 2025-05-07T20:33:08.8569270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.8569915Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.8570322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8570545Z 2025-05-07T20:33:08.8570750Z self = 2025-05-07T20:33:08.8571820Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.8573350Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05538579c0>} 2025-05-07T20:33:08.8574667Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.8575673Z context = 2025-05-07T20:33:08.8576016Z 2025-05-07T20:33:08.8576181Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.8576688Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.8577144Z module_map=module_map) 2025-05-07T20:33:08.8577502Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.8577850Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.8578106Z E ^ 2025-05-07T20:33:08.8578631Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.8579122Z 2025-05-07T20:33:08.8579530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.8580037Z 2025-05-07T20:33:08.8580139Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8580548Z self=, 2025-05-07T20:33:08.8580936Z T=128, 2025-05-07T20:33:08.8581124Z D=5120, 2025-05-07T20:33:08.8581311Z scale_ub=None, 2025-05-07T20:33:08.8581520Z contiguous=False, 2025-05-07T20:33:08.8581738Z compiled=True, 2025-05-07T20:33:08.8581939Z ) 2025-05-07T20:33:08.8582250Z self = 2025-05-07T20:33:08.8582729Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.8582998Z 2025-05-07T20:33:08.8583076Z @given( 2025-05-07T20:33:08.8583303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.8583609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.8583911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.8584237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.8584555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.8584840Z ) 2025-05-07T20:33:08.8585188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.8585617Z def test_silu_mul_quant( 2025-05-07T20:33:08.8585866Z self, 2025-05-07T20:33:08.8586066Z T: int, 2025-05-07T20:33:08.8586262Z D: int, 2025-05-07T20:33:08.8586478Z scale_ub: Optional[float], 2025-05-07T20:33:08.8586750Z contiguous: bool, 2025-05-07T20:33:08.8586991Z compiled: bool, 2025-05-07T20:33:08.8587215Z ) -> None: 2025-05-07T20:33:08.8587434Z torch.manual_seed(2025) 2025-05-07T20:33:08.8587675Z 2025-05-07T20:33:08.8587939Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.8588276Z 2025-05-07T20:33:08.8588475Z x_sign = torch.sign(x) 2025-05-07T20:33:08.8588759Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.8589068Z x = x_sign * x_clamp 2025-05-07T20:33:08.8589308Z x0 = x[:, :D] 2025-05-07T20:33:08.8589524Z x1 = x[:, D:] 2025-05-07T20:33:08.8589739Z 2025-05-07T20:33:08.8589931Z if contiguous: 2025-05-07T20:33:08.8590157Z x0 = x0.contiguous() 2025-05-07T20:33:08.8590418Z x1 = x1.contiguous() 2025-05-07T20:33:08.8590665Z 2025-05-07T20:33:08.8590853Z if scale_ub is not None: 2025-05-07T20:33:08.8591130Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.8591512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.8591815Z ) 2025-05-07T20:33:08.8592044Z else: 2025-05-07T20:33:08.8592254Z scale_ub_tensor = None 2025-05-07T20:33:08.8592505Z 2025-05-07T20:33:08.8592727Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.8593034Z op = silu_mul_quant 2025-05-07T20:33:08.8593283Z if compiled: 2025-05-07T20:33:08.8593524Z op = torch.compile(op) 2025-05-07T20:33:08.8593816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8594131Z 2025-05-07T20:33:08.8594316Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.8594482Z 2025-05-07T20:33:08.8594581Z moe/activation_test.py:117: 2025-05-07T20:33:08.8594870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8595187Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.8595466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8596020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.8596606Z return fn(*args, **kwargs) 2025-05-07T20:33:08.8597247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.8597917Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.8598448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.8599161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.8599807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.8600325Z kernel = self.compile( 2025-05-07T20:33:08.8600857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.8601493Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.8601885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8602108Z 2025-05-07T20:33:08.8602319Z self = 2025-05-07T20:33:08.8603371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.8604708Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553854ea0>} 2025-05-07T20:33:08.8606021Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.8607027Z context = 2025-05-07T20:33:08.8607310Z 2025-05-07T20:33:08.8607476Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.8607982Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.8608437Z module_map=module_map) 2025-05-07T20:33:08.8608828Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.8609197Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.8609446Z E ^ 2025-05-07T20:33:08.8609902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.8610344Z 2025-05-07T20:33:08.8610758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.8611305Z 2025-05-07T20:33:08.8611414Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8611859Z self=, 2025-05-07T20:33:08.8612250Z T=128, 2025-05-07T20:33:08.8612434Z D=7168, 2025-05-07T20:33:08.8612624Z scale_ub=1200.0, 2025-05-07T20:33:08.8612844Z contiguous=False, 2025-05-07T20:33:08.8613139Z compiled=False, 2025-05-07T20:33:08.8613335Z ) 2025-05-07T20:33:08.9729513Z self = 2025-05-07T20:33:08.9730368Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.9730744Z 2025-05-07T20:33:08.9730856Z @given( 2025-05-07T20:33:08.9731174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.9731600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.9732018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.9732450Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.9732786Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.9733142Z ) 2025-05-07T20:33:08.9733582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.9734027Z def test_silu_mul_quant( 2025-05-07T20:33:08.9734279Z self, 2025-05-07T20:33:08.9734482Z T: int, 2025-05-07T20:33:08.9734689Z D: int, 2025-05-07T20:33:08.9734916Z scale_ub: Optional[float], 2025-05-07T20:33:08.9735194Z contiguous: bool, 2025-05-07T20:33:08.9735441Z compiled: bool, 2025-05-07T20:33:08.9735684Z ) -> None: 2025-05-07T20:33:08.9735907Z torch.manual_seed(2025) 2025-05-07T20:33:08.9736156Z 2025-05-07T20:33:08.9736432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.9736793Z 2025-05-07T20:33:08.9737004Z x_sign = torch.sign(x) 2025-05-07T20:33:08.9737302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.9737629Z x = x_sign * x_clamp 2025-05-07T20:33:08.9737898Z x0 = x[:, :D] 2025-05-07T20:33:08.9738137Z x1 = x[:, D:] 2025-05-07T20:33:08.9738355Z 2025-05-07T20:33:08.9738555Z if contiguous: 2025-05-07T20:33:08.9738792Z x0 = x0.contiguous() 2025-05-07T20:33:08.9739061Z x1 = x1.contiguous() 2025-05-07T20:33:08.9739323Z 2025-05-07T20:33:08.9739523Z if scale_ub is not None: 2025-05-07T20:33:08.9739808Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.9740158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.9740472Z ) 2025-05-07T20:33:08.9740680Z else: 2025-05-07T20:33:08.9740902Z scale_ub_tensor = None 2025-05-07T20:33:08.9741167Z 2025-05-07T20:33:08.9741406Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.9741727Z op = silu_mul_quant 2025-05-07T20:33:08.9741991Z if compiled: 2025-05-07T20:33:08.9742246Z op = torch.compile(op) 2025-05-07T20:33:08.9742550Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.9742841Z 2025-05-07T20:33:08.9743040Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.9743212Z 2025-05-07T20:33:08.9743317Z moe/activation_test.py:117: 2025-05-07T20:33:08.9743618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.9743950Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.9744238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.9744925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.9745608Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.9746141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.9746820Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.9747600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.9748124Z kernel = self.compile( 2025-05-07T20:33:08.9748712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.9749357Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.9749749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.9750042Z 2025-05-07T20:33:08.9750247Z self = 2025-05-07T20:33:08.9751304Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.9752654Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553fb3e20>} 2025-05-07T20:33:08.9754008Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.9755013Z context = 2025-05-07T20:33:08.9755295Z 2025-05-07T20:33:08.9755465Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.9755978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.9756440Z module_map=module_map) 2025-05-07T20:33:08.9756798Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.9757148Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.9757411Z E ^ 2025-05-07T20:33:08.9757872Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.9758315Z 2025-05-07T20:33:08.9758728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.9759469Z 2025-05-07T20:33:08.9759576Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.9759986Z self=, 2025-05-07T20:33:08.9760379Z T=128, 2025-05-07T20:33:08.9760569Z D=5120, 2025-05-07T20:33:08.9760762Z scale_ub=None, 2025-05-07T20:33:08.9760974Z contiguous=False, 2025-05-07T20:33:08.9761197Z compiled=False, 2025-05-07T20:33:08.9761403Z ) 2025-05-07T20:33:08.9761717Z self = 2025-05-07T20:33:08.9762227Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.9762493Z 2025-05-07T20:33:08.9762570Z @given( 2025-05-07T20:33:08.9762803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.9763115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.9763416Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.9763743Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.9764066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.9764343Z ) 2025-05-07T20:33:08.9764685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.9765127Z def test_silu_mul_quant( 2025-05-07T20:33:08.9765370Z self, 2025-05-07T20:33:08.9765560Z T: int, 2025-05-07T20:33:08.9765763Z D: int, 2025-05-07T20:33:08.9765984Z scale_ub: Optional[float], 2025-05-07T20:33:08.9766249Z contiguous: bool, 2025-05-07T20:33:08.9766488Z compiled: bool, 2025-05-07T20:33:08.9766709Z ) -> None: 2025-05-07T20:33:08.9767002Z torch.manual_seed(2025) 2025-05-07T20:33:08.9767242Z 2025-05-07T20:33:08.9767571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.9767912Z 2025-05-07T20:33:08.9768117Z x_sign = torch.sign(x) 2025-05-07T20:33:08.9768413Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.9768720Z x = x_sign * x_clamp 2025-05-07T20:33:08.9768963Z x0 = x[:, :D] 2025-05-07T20:33:08.9769179Z x1 = x[:, D:] 2025-05-07T20:33:08.9769390Z 2025-05-07T20:33:08.9769650Z if contiguous: 2025-05-07T20:33:08.9769882Z x0 = x0.contiguous() 2025-05-07T20:33:08.9770140Z x1 = x1.contiguous() 2025-05-07T20:33:08.9770382Z 2025-05-07T20:33:08.9770575Z if scale_ub is not None: 2025-05-07T20:33:08.9770856Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.9771188Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.9777886Z ) 2025-05-07T20:33:08.9778106Z else: 2025-05-07T20:33:08.9778322Z scale_ub_tensor = None 2025-05-07T20:33:08.9778605Z 2025-05-07T20:33:08.9778971Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.9779293Z op = silu_mul_quant 2025-05-07T20:33:08.9779538Z if compiled: 2025-05-07T20:33:08.9779788Z op = torch.compile(op) 2025-05-07T20:33:08.9780087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.9780356Z 2025-05-07T20:33:08.9780542Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.9780711Z 2025-05-07T20:33:08.9780813Z moe/activation_test.py:117: 2025-05-07T20:33:08.9781108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.9781430Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.9781711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.9782387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.9783066Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.9783591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.9784254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.9784898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.9785417Z kernel = self.compile( 2025-05-07T20:33:08.9785949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.9786587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.9786977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.9787200Z 2025-05-07T20:33:08.9787406Z self = 2025-05-07T20:33:08.9788477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.9789870Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055385c400>} 2025-05-07T20:33:08.9791189Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.9792190Z context = 2025-05-07T20:33:08.9792474Z 2025-05-07T20:33:08.9792636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.9793195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.9793690Z module_map=module_map) 2025-05-07T20:33:08.9794045Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.9794392Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.9794645Z E ^ 2025-05-07T20:33:08.9795094Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.9795540Z 2025-05-07T20:33:08.9795948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.9796492Z 2025-05-07T20:33:08.9796595Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.9797006Z self=, 2025-05-07T20:33:08.9797395Z T=128, 2025-05-07T20:33:08.9797581Z D=5120, 2025-05-07T20:33:08.9797776Z scale_ub=1200.0, 2025-05-07T20:33:08.9797994Z contiguous=True, 2025-05-07T20:33:08.9798215Z compiled=False, 2025-05-07T20:33:08.9798417Z ) 2025-05-07T20:33:09.1497477Z self = 2025-05-07T20:33:09.1498878Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:09.1499292Z 2025-05-07T20:33:09.1499399Z @given( 2025-05-07T20:33:09.1499713Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.1500115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.1500435Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.1500768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.1501099Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.1501381Z ) 2025-05-07T20:33:09.1501732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.1502169Z def test_silu_mul_quant( 2025-05-07T20:33:09.1502412Z self, 2025-05-07T20:33:09.1502610Z T: int, 2025-05-07T20:33:09.1502809Z D: int, 2025-05-07T20:33:09.1503033Z scale_ub: Optional[float], 2025-05-07T20:33:09.1503304Z contiguous: bool, 2025-05-07T20:33:09.1503550Z compiled: bool, 2025-05-07T20:33:09.1503773Z ) -> None: 2025-05-07T20:33:09.1503996Z torch.manual_seed(2025) 2025-05-07T20:33:09.1504246Z 2025-05-07T20:33:09.1504522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.1504873Z 2025-05-07T20:33:09.1505076Z x_sign = torch.sign(x) 2025-05-07T20:33:09.1505367Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.1505678Z x = x_sign * x_clamp 2025-05-07T20:33:09.1505931Z x0 = x[:, :D] 2025-05-07T20:33:09.1506156Z x1 = x[:, D:] 2025-05-07T20:33:09.1506366Z 2025-05-07T20:33:09.1506557Z if contiguous: 2025-05-07T20:33:09.1506793Z x0 = x0.contiguous() 2025-05-07T20:33:09.1507058Z x1 = x1.contiguous() 2025-05-07T20:33:09.1507309Z 2025-05-07T20:33:09.1507510Z if scale_ub is not None: 2025-05-07T20:33:09.1507786Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.1508125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.1508433Z ) 2025-05-07T20:33:09.1508623Z else: 2025-05-07T20:33:09.1508846Z scale_ub_tensor = None 2025-05-07T20:33:09.1509104Z 2025-05-07T20:33:09.1509344Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.1509664Z op = silu_mul_quant 2025-05-07T20:33:09.1509922Z if compiled: 2025-05-07T20:33:09.1510171Z op = torch.compile(op) 2025-05-07T20:33:09.1510471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1510751Z 2025-05-07T20:33:09.1510950Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.1511115Z 2025-05-07T20:33:09.1511218Z moe/activation_test.py:117: 2025-05-07T20:33:09.1511609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1511999Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.1512280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1512964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.1513644Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.1514171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.1514920Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.1515575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.1516097Z kernel = self.compile( 2025-05-07T20:33:09.1516633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.1517291Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.1517728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1517955Z 2025-05-07T20:33:09.1518167Z self = 2025-05-07T20:33:09.1519278Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.1520636Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055385d300>} 2025-05-07T20:33:09.1521951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.1522969Z context = 2025-05-07T20:33:09.1523251Z 2025-05-07T20:33:09.1523422Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.1523930Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.1524393Z module_map=module_map) 2025-05-07T20:33:09.1524762Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.1525123Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.1525390Z E ^ 2025-05-07T20:33:09.1525855Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.1526298Z 2025-05-07T20:33:09.1526715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.1527222Z 2025-05-07T20:33:09.1527327Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.1527749Z self=, 2025-05-07T20:33:09.1528154Z T=1, 2025-05-07T20:33:09.1528342Z D=7168, 2025-05-07T20:33:09.1528542Z scale_ub=1200.0, 2025-05-07T20:33:09.1528769Z contiguous=True, 2025-05-07T20:33:09.1529005Z compiled=True, 2025-05-07T20:33:09.1529247Z ) 2025-05-07T20:33:09.1529566Z self = 2025-05-07T20:33:09.1530048Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:09.1530312Z 2025-05-07T20:33:09.1530390Z @given( 2025-05-07T20:33:09.1530622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.1530937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.1531241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.1531620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.1531954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.1532299Z ) 2025-05-07T20:33:09.1532650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.1533177Z def test_silu_mul_quant( 2025-05-07T20:33:09.1533413Z self, 2025-05-07T20:33:09.1533615Z T: int, 2025-05-07T20:33:09.1533822Z D: int, 2025-05-07T20:33:09.1534040Z scale_ub: Optional[float], 2025-05-07T20:33:09.1534319Z contiguous: bool, 2025-05-07T20:33:09.1534608Z compiled: bool, 2025-05-07T20:33:09.1534828Z ) -> None: 2025-05-07T20:33:09.1535044Z torch.manual_seed(2025) 2025-05-07T20:33:09.1535282Z 2025-05-07T20:33:09.1535554Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.1535914Z 2025-05-07T20:33:09.1536113Z x_sign = torch.sign(x) 2025-05-07T20:33:09.1536398Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.1536704Z x = x_sign * x_clamp 2025-05-07T20:33:09.1536944Z x0 = x[:, :D] 2025-05-07T20:33:09.1537216Z x1 = x[:, D:] 2025-05-07T20:33:09.1537425Z 2025-05-07T20:33:09.1537614Z if contiguous: 2025-05-07T20:33:09.1537850Z x0 = x0.contiguous() 2025-05-07T20:33:09.1538106Z x1 = x1.contiguous() 2025-05-07T20:33:09.1538351Z 2025-05-07T20:33:09.1538545Z if scale_ub is not None: 2025-05-07T20:33:09.1538818Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.1539151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.1539456Z ) 2025-05-07T20:33:09.1539650Z else: 2025-05-07T20:33:09.1539856Z scale_ub_tensor = None 2025-05-07T20:33:09.1540110Z 2025-05-07T20:33:09.1540339Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.1540650Z op = silu_mul_quant 2025-05-07T20:33:09.1540907Z if compiled: 2025-05-07T20:33:09.1541153Z op = torch.compile(op) 2025-05-07T20:33:09.1541446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1541728Z 2025-05-07T20:33:09.1541924Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.1542086Z 2025-05-07T20:33:09.1542187Z moe/activation_test.py:117: 2025-05-07T20:33:09.1542492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1542833Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.1543110Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1543662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.1544218Z return fn(*args, **kwargs) 2025-05-07T20:33:09.1544873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.1545541Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.1546082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.1546758Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.1547413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.1547938Z kernel = self.compile( 2025-05-07T20:33:09.1548484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.1549129Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.1549520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1549747Z 2025-05-07T20:33:09.1549951Z self = 2025-05-07T20:33:09.1551061Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.1552442Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055385eac0>} 2025-05-07T20:33:09.1553759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.1554805Z context = 2025-05-07T20:33:09.1555088Z 2025-05-07T20:33:09.1555253Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.1555766Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.1556232Z module_map=module_map) 2025-05-07T20:33:09.1556590Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.1556989Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.1557257Z E ^ 2025-05-07T20:33:09.1557718Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.1558164Z 2025-05-07T20:33:09.1558575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.1559083Z 2025-05-07T20:33:09.1559356Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.1559770Z self=, 2025-05-07T20:33:09.1560159Z T=1, 2025-05-07T20:33:09.1560349Z D=7168, 2025-05-07T20:33:09.1560545Z scale_ub=1200.0, 2025-05-07T20:33:09.1560763Z contiguous=False, 2025-05-07T20:33:09.1560987Z compiled=True, 2025-05-07T20:33:09.1561196Z ) 2025-05-07T20:33:09.2860690Z self = 2025-05-07T20:33:09.2861455Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:09.2861825Z 2025-05-07T20:33:09.2861941Z @given( 2025-05-07T20:33:09.2862178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.2862490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.2862800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.2863130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.2863465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.2863745Z ) 2025-05-07T20:33:09.2864093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.2864527Z def test_silu_mul_quant( 2025-05-07T20:33:09.2864769Z self, 2025-05-07T20:33:09.2864970Z T: int, 2025-05-07T20:33:09.2865170Z D: int, 2025-05-07T20:33:09.2865388Z scale_ub: Optional[float], 2025-05-07T20:33:09.2865668Z contiguous: bool, 2025-05-07T20:33:09.2865910Z compiled: bool, 2025-05-07T20:33:09.2866134Z ) -> None: 2025-05-07T20:33:09.2866353Z torch.manual_seed(2025) 2025-05-07T20:33:09.2866594Z 2025-05-07T20:33:09.2866921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.2867396Z 2025-05-07T20:33:09.2867669Z x_sign = torch.sign(x) 2025-05-07T20:33:09.2868020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.2868336Z x = x_sign * x_clamp 2025-05-07T20:33:09.2868583Z x0 = x[:, :D] 2025-05-07T20:33:09.2868814Z x1 = x[:, D:] 2025-05-07T20:33:09.2869024Z 2025-05-07T20:33:09.2869223Z if contiguous: 2025-05-07T20:33:09.2869463Z x0 = x0.contiguous() 2025-05-07T20:33:09.2869721Z x1 = x1.contiguous() 2025-05-07T20:33:09.2869969Z 2025-05-07T20:33:09.2870283Z if scale_ub is not None: 2025-05-07T20:33:09.2870557Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.2870954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.2871265Z ) 2025-05-07T20:33:09.2871458Z else: 2025-05-07T20:33:09.2871673Z scale_ub_tensor = None 2025-05-07T20:33:09.2871926Z 2025-05-07T20:33:09.2872154Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.2872473Z op = silu_mul_quant 2025-05-07T20:33:09.2872726Z if compiled: 2025-05-07T20:33:09.2873040Z op = torch.compile(op) 2025-05-07T20:33:09.2873329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.2873604Z 2025-05-07T20:33:09.2873799Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.2873961Z 2025-05-07T20:33:09.2874060Z moe/activation_test.py:117: 2025-05-07T20:33:09.2874353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.2874680Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.2874956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.2875574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.2876124Z return fn(*args, **kwargs) 2025-05-07T20:33:09.2876777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.2877444Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.2877972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.2878647Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.2879294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.2879814Z kernel = self.compile( 2025-05-07T20:33:09.2880356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.2881011Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.2881402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.2881629Z 2025-05-07T20:33:09.2881834Z self = 2025-05-07T20:33:09.2882900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.2884249Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05538079c0>} 2025-05-07T20:33:09.2885561Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.2886572Z context = 2025-05-07T20:33:09.2886861Z 2025-05-07T20:33:09.2887026Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.2887535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.2887993Z module_map=module_map) 2025-05-07T20:33:09.2888362Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.2888715Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.2888972Z E ^ 2025-05-07T20:33:09.2889425Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.2889873Z 2025-05-07T20:33:09.2890284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.2890835Z 2025-05-07T20:33:09.2890984Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.2891393Z self=, 2025-05-07T20:33:09.2891793Z T=1, 2025-05-07T20:33:09.2891979Z D=7168, 2025-05-07T20:33:09.2892173Z scale_ub=None, 2025-05-07T20:33:09.2892382Z contiguous=False, 2025-05-07T20:33:09.2892610Z compiled=True, 2025-05-07T20:33:09.2892810Z ) 2025-05-07T20:33:09.3749999Z self = 2025-05-07T20:33:09.3750869Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:09.3751227Z 2025-05-07T20:33:09.3751339Z @given( 2025-05-07T20:33:09.3751615Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.3751921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.3752224Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.3752557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.3752884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.3753238Z ) 2025-05-07T20:33:09.3753584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.3754020Z def test_silu_mul_quant( 2025-05-07T20:33:09.3754255Z self, 2025-05-07T20:33:09.3754445Z T: int, 2025-05-07T20:33:09.3754641Z D: int, 2025-05-07T20:33:09.3754853Z scale_ub: Optional[float], 2025-05-07T20:33:09.3755124Z contiguous: bool, 2025-05-07T20:33:09.3755359Z compiled: bool, 2025-05-07T20:33:09.3755578Z ) -> None: 2025-05-07T20:33:09.3755796Z torch.manual_seed(2025) 2025-05-07T20:33:09.3756033Z 2025-05-07T20:33:09.3756296Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.3756632Z 2025-05-07T20:33:09.3756826Z x_sign = torch.sign(x) 2025-05-07T20:33:09.3757113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.3757421Z x = x_sign * x_clamp 2025-05-07T20:33:09.3757668Z x0 = x[:, :D] 2025-05-07T20:33:09.3757882Z x1 = x[:, D:] 2025-05-07T20:33:09.3758086Z 2025-05-07T20:33:09.3758265Z if contiguous: 2025-05-07T20:33:09.3758506Z x0 = x0.contiguous() 2025-05-07T20:33:09.3758799Z x1 = x1.contiguous() 2025-05-07T20:33:09.3759035Z 2025-05-07T20:33:09.3759416Z if scale_ub is not None: 2025-05-07T20:33:09.3759685Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.3760015Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.3760321Z ) 2025-05-07T20:33:09.3760508Z else: 2025-05-07T20:33:09.3760720Z scale_ub_tensor = None 2025-05-07T20:33:09.3760972Z 2025-05-07T20:33:09.3761198Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.3761507Z op = silu_mul_quant 2025-05-07T20:33:09.3761754Z if compiled: 2025-05-07T20:33:09.3761995Z op = torch.compile(op) 2025-05-07T20:33:09.3762293Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.3762565Z 2025-05-07T20:33:09.3762751Z y_fp8, y_scale = fn() 2025-05-07T20:33:09.3763035Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:09.3763322Z 2025-05-07T20:33:09.3763556Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.3763883Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:09.3764171Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:09.3764478Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:09.3764826Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:09.3765131Z 2025-05-07T20:33:09.3765331Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:09.3765521Z 2025-05-07T20:33:09.3765720Z moe/activation_test.py:126: 2025-05-07T20:33:09.3766014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3766459Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:09.3766784Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:09.3767555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:09.3768292Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:09.3768838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.3769603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.3770280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:09.3770986Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:09.3771707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:09.3772384Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:09.3773043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:09.3773551Z fn() 2025-05-07T20:33:09.3774060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:09.3774632Z self.fn.run( 2025-05-07T20:33:09.3775101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.3775619Z kernel = self.compile( 2025-05-07T20:33:09.3781963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.3782608Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.3783006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3783234Z 2025-05-07T20:33:09.3783441Z self = 2025-05-07T20:33:09.3784512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.3785865Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552298b80>} 2025-05-07T20:33:09.3787189Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.3788195Z context = 2025-05-07T20:33:09.3788478Z 2025-05-07T20:33:09.3788676Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.3789213Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.3789679Z module_map=module_map) 2025-05-07T20:33:09.3790031Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.3790383Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:09.3790647Z E ^ 2025-05-07T20:33:09.3791101Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.3791546Z 2025-05-07T20:33:09.3791957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.3792467Z 2025-05-07T20:33:09.3792569Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.3793052Z self=, 2025-05-07T20:33:09.3793441Z T=1, 2025-05-07T20:33:09.3793670Z D=5120, 2025-05-07T20:33:09.3793863Z scale_ub=1200.0, 2025-05-07T20:33:09.3794079Z contiguous=False, 2025-05-07T20:33:09.3794296Z compiled=True, 2025-05-07T20:33:09.3794494Z ) 2025-05-07T20:33:09.5322002Z self = 2025-05-07T20:33:09.5322786Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:09.5323288Z 2025-05-07T20:33:09.5323398Z @given( 2025-05-07T20:33:09.5323717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5324124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5324528Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5324945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5325266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5325553Z ) 2025-05-07T20:33:09.5325902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5326418Z def test_silu_mul_quant( 2025-05-07T20:33:09.5326661Z self, 2025-05-07T20:33:09.5326857Z T: int, 2025-05-07T20:33:09.5327051Z D: int, 2025-05-07T20:33:09.5327266Z scale_ub: Optional[float], 2025-05-07T20:33:09.5327539Z contiguous: bool, 2025-05-07T20:33:09.5327775Z compiled: bool, 2025-05-07T20:33:09.5328002Z ) -> None: 2025-05-07T20:33:09.5328223Z torch.manual_seed(2025) 2025-05-07T20:33:09.5328464Z 2025-05-07T20:33:09.5328728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.5329097Z 2025-05-07T20:33:09.5329316Z x_sign = torch.sign(x) 2025-05-07T20:33:09.5329598Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.5329905Z x = x_sign * x_clamp 2025-05-07T20:33:09.5330146Z x0 = x[:, :D] 2025-05-07T20:33:09.5330354Z x1 = x[:, D:] 2025-05-07T20:33:09.5330569Z 2025-05-07T20:33:09.5330763Z if contiguous: 2025-05-07T20:33:09.5330995Z x0 = x0.contiguous() 2025-05-07T20:33:09.5331254Z x1 = x1.contiguous() 2025-05-07T20:33:09.5331501Z 2025-05-07T20:33:09.5331694Z if scale_ub is not None: 2025-05-07T20:33:09.5331965Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.5332302Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.5332604Z ) 2025-05-07T20:33:09.5332800Z else: 2025-05-07T20:33:09.5333104Z scale_ub_tensor = None 2025-05-07T20:33:09.5333351Z 2025-05-07T20:33:09.5333589Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.5333901Z op = silu_mul_quant 2025-05-07T20:33:09.5334151Z if compiled: 2025-05-07T20:33:09.5334389Z op = torch.compile(op) 2025-05-07T20:33:09.5334684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.5334960Z 2025-05-07T20:33:09.5335150Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.5335321Z 2025-05-07T20:33:09.5335418Z moe/activation_test.py:117: 2025-05-07T20:33:09.5335712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.5336030Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.5336309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.5336856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.5337413Z return fn(*args, **kwargs) 2025-05-07T20:33:09.5338058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.5338737Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.5339261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.5339994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.5340704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.5341232Z kernel = self.compile( 2025-05-07T20:33:09.5341762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.5342396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.5342832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.5343054Z 2025-05-07T20:33:09.5343261Z self = 2025-05-07T20:33:09.5344323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.5345712Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552299e40>} 2025-05-07T20:33:09.5347028Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.5348050Z context = 2025-05-07T20:33:09.5348333Z 2025-05-07T20:33:09.5348513Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.5349071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.5349536Z module_map=module_map) 2025-05-07T20:33:09.5349900Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.5350251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.5350504Z E ^ 2025-05-07T20:33:09.5350966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.5351405Z 2025-05-07T20:33:09.5351825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.5352327Z 2025-05-07T20:33:09.5352438Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5352839Z self=, 2025-05-07T20:33:09.5353237Z T=1, 2025-05-07T20:33:09.5353418Z D=5120, 2025-05-07T20:33:09.5353604Z scale_ub=1200.0, 2025-05-07T20:33:09.5353828Z contiguous=False, 2025-05-07T20:33:09.5354053Z compiled=False, 2025-05-07T20:33:09.5354254Z ) 2025-05-07T20:33:09.5354573Z self = 2025-05-07T20:33:09.5355060Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:09.5355320Z 2025-05-07T20:33:09.5355399Z @given( 2025-05-07T20:33:09.5355635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5355944Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5356247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5356568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5356898Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5357180Z ) 2025-05-07T20:33:09.5357518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5357954Z def test_silu_mul_quant( 2025-05-07T20:33:09.5358190Z self, 2025-05-07T20:33:09.5358383Z T: int, 2025-05-07T20:33:09.5358587Z D: int, 2025-05-07T20:33:09.5358812Z scale_ub: Optional[float], 2025-05-07T20:33:09.5359085Z contiguous: bool, 2025-05-07T20:33:09.5359562Z compiled: bool, 2025-05-07T20:33:09.5359782Z ) -> None: 2025-05-07T20:33:09.5360594Z torch.manual_seed(2025) 2025-05-07T20:33:09.5360845Z 2025-05-07T20:33:09.5361116Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.5361448Z 2025-05-07T20:33:09.5361640Z x_sign = torch.sign(x) 2025-05-07T20:33:09.5361930Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.5362233Z x = x_sign * x_clamp 2025-05-07T20:33:09.5362465Z x0 = x[:, :D] 2025-05-07T20:33:09.5362751Z x1 = x[:, D:] 2025-05-07T20:33:09.5362962Z 2025-05-07T20:33:09.5363149Z if contiguous: 2025-05-07T20:33:09.5363375Z x0 = x0.contiguous() 2025-05-07T20:33:09.5363631Z x1 = x1.contiguous() 2025-05-07T20:33:09.5363867Z 2025-05-07T20:33:09.5364065Z if scale_ub is not None: 2025-05-07T20:33:09.5364336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.5364665Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.5364970Z ) 2025-05-07T20:33:09.5365169Z else: 2025-05-07T20:33:09.5365462Z scale_ub_tensor = None 2025-05-07T20:33:09.5365717Z 2025-05-07T20:33:09.5365954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.5366258Z op = silu_mul_quant 2025-05-07T20:33:09.5366508Z if compiled: 2025-05-07T20:33:09.5366758Z op = torch.compile(op) 2025-05-07T20:33:09.5367051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.5367326Z 2025-05-07T20:33:09.5367518Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.5367681Z 2025-05-07T20:33:09.5367778Z moe/activation_test.py:117: 2025-05-07T20:33:09.5368065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.5368401Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.5368681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.5369416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.5370092Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.5370623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.5371287Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.5371939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.5372472Z kernel = self.compile( 2025-05-07T20:33:09.5373044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.5373687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.5374075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.5374300Z 2025-05-07T20:33:09.5374507Z self = 2025-05-07T20:33:09.5375562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.5376907Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055229aac0>} 2025-05-07T20:33:09.5378222Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.5379228Z context = 2025-05-07T20:33:09.5379509Z 2025-05-07T20:33:09.5379745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.5380294Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.5380751Z module_map=module_map) 2025-05-07T20:33:09.5381113Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.5381452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.5381710Z E ^ 2025-05-07T20:33:09.5382168Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.5382650Z 2025-05-07T20:33:09.5383061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.5383560Z 2025-05-07T20:33:09.5383662Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5384081Z self=, 2025-05-07T20:33:09.5384479Z T=16384, 2025-05-07T20:33:09.5384674Z D=5120, 2025-05-07T20:33:09.5384858Z scale_ub=1200.0, 2025-05-07T20:33:09.5385080Z contiguous=False, 2025-05-07T20:33:09.5385349Z compiled=True, 2025-05-07T20:33:09.5385546Z ) 2025-05-07T20:33:09.6248842Z self = 2025-05-07T20:33:09.6249558Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:09.6249955Z 2025-05-07T20:33:09.6250067Z @given( 2025-05-07T20:33:09.6250384Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.6250825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.6251132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.6251456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.6251783Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.6252070Z ) 2025-05-07T20:33:09.6252418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.6252857Z def test_silu_mul_quant( 2025-05-07T20:33:09.6253154Z self, 2025-05-07T20:33:09.6253358Z T: int, 2025-05-07T20:33:09.6253557Z D: int, 2025-05-07T20:33:09.6253774Z scale_ub: Optional[float], 2025-05-07T20:33:09.6254047Z contiguous: bool, 2025-05-07T20:33:09.6254289Z compiled: bool, 2025-05-07T20:33:09.6254516Z ) -> None: 2025-05-07T20:33:09.6254728Z torch.manual_seed(2025) 2025-05-07T20:33:09.6254971Z 2025-05-07T20:33:09.6255246Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.6255591Z 2025-05-07T20:33:09.6255788Z x_sign = torch.sign(x) 2025-05-07T20:33:09.6256082Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.6256385Z x = x_sign * x_clamp 2025-05-07T20:33:09.6256628Z x0 = x[:, :D] 2025-05-07T20:33:09.6256845Z x1 = x[:, D:] 2025-05-07T20:33:09.6257051Z 2025-05-07T20:33:09.6257240Z if contiguous: 2025-05-07T20:33:09.6257471Z x0 = x0.contiguous() 2025-05-07T20:33:09.6257726Z x1 = x1.contiguous() 2025-05-07T20:33:09.6257974Z 2025-05-07T20:33:09.6258164Z if scale_ub is not None: 2025-05-07T20:33:09.6258437Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.6258775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.6259084Z ) 2025-05-07T20:33:09.6259433Z else: 2025-05-07T20:33:09.6259638Z scale_ub_tensor = None 2025-05-07T20:33:09.6259898Z 2025-05-07T20:33:09.6260129Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.6260439Z op = silu_mul_quant 2025-05-07T20:33:09.6260691Z if compiled: 2025-05-07T20:33:09.6260943Z op = torch.compile(op) 2025-05-07T20:33:09.6261233Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.6261511Z 2025-05-07T20:33:09.6261814Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.6261978Z 2025-05-07T20:33:09.6262079Z moe/activation_test.py:117: 2025-05-07T20:33:09.6262444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.6262774Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.6263054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.6263606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.6264168Z return fn(*args, **kwargs) 2025-05-07T20:33:09.6264825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.6265562Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.6266090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.6266764Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.6267429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.6268006Z kernel = self.compile( 2025-05-07T20:33:09.6268547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.6269196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.6269591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.6269827Z 2025-05-07T20:33:09.6270032Z self = 2025-05-07T20:33:09.6271095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.6272450Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ff0180>} 2025-05-07T20:33:09.6273780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.6274784Z context = 2025-05-07T20:33:09.6275074Z 2025-05-07T20:33:09.6275239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.6275753Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.6276219Z module_map=module_map) 2025-05-07T20:33:09.6276579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.6276932Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.6277190Z E ^ 2025-05-07T20:33:09.6277644Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.6278094Z 2025-05-07T20:33:09.6278510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.6279068Z 2025-05-07T20:33:09.6279174Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.6279584Z self=, 2025-05-07T20:33:09.6279976Z T=2048, 2025-05-07T20:33:09.6280168Z D=7168, 2025-05-07T20:33:09.6280363Z scale_ub=1200.0, 2025-05-07T20:33:09.6280580Z contiguous=False, 2025-05-07T20:33:09.6280807Z compiled=True, 2025-05-07T20:33:09.6281008Z ) 2025-05-07T20:33:09.6281321Z self = 2025-05-07T20:33:09.6281813Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:09.6282133Z 2025-05-07T20:33:09.6282218Z @given( 2025-05-07T20:33:09.6282446Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.6282792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.6283101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.6283432Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.6283757Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.6284045Z ) 2025-05-07T20:33:09.6284389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.6284865Z def test_silu_mul_quant( 2025-05-07T20:33:09.6285109Z self, 2025-05-07T20:33:09.6285306Z T: int, 2025-05-07T20:33:09.6285502Z D: int, 2025-05-07T20:33:09.6285723Z scale_ub: Optional[float], 2025-05-07T20:33:09.6285997Z contiguous: bool, 2025-05-07T20:33:09.6286232Z compiled: bool, 2025-05-07T20:33:09.6286457Z ) -> None: 2025-05-07T20:33:09.6286681Z torch.manual_seed(2025) 2025-05-07T20:33:09.6286925Z 2025-05-07T20:33:09.6287197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.6287586Z 2025-05-07T20:33:09.6287781Z x_sign = torch.sign(x) 2025-05-07T20:33:09.6288068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.6288378Z x = x_sign * x_clamp 2025-05-07T20:33:09.6288623Z x0 = x[:, :D] 2025-05-07T20:33:09.6288838Z x1 = x[:, D:] 2025-05-07T20:33:09.6289052Z 2025-05-07T20:33:09.6289267Z if contiguous: 2025-05-07T20:33:09.6289524Z x0 = x0.contiguous() 2025-05-07T20:33:09.6289781Z x1 = x1.contiguous() 2025-05-07T20:33:09.6290024Z 2025-05-07T20:33:09.6290210Z if scale_ub is not None: 2025-05-07T20:33:09.6290488Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.6290820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.6291130Z ) 2025-05-07T20:33:09.6291321Z else: 2025-05-07T20:33:09.6291533Z scale_ub_tensor = None 2025-05-07T20:33:09.6291802Z 2025-05-07T20:33:09.6292039Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.6292353Z op = silu_mul_quant 2025-05-07T20:33:09.6292606Z if compiled: 2025-05-07T20:33:09.6292851Z op = torch.compile(op) 2025-05-07T20:33:09.6293215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.6293489Z 2025-05-07T20:33:09.6293678Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.6293851Z 2025-05-07T20:33:09.6293953Z moe/activation_test.py:117: 2025-05-07T20:33:09.6294247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.6294574Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.6294849Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.6295400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.6295951Z return fn(*args, **kwargs) 2025-05-07T20:33:09.6296605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.6297282Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.6297810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.6298483Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.6299135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.6299662Z kernel = self.compile( 2025-05-07T20:33:09.6300200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.6300838Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.6301284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.6301518Z 2025-05-07T20:33:09.6301766Z self = 2025-05-07T20:33:09.6302844Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.6304190Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ff0ea0>} 2025-05-07T20:33:09.6305572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.6306580Z context = 2025-05-07T20:33:09.6306868Z 2025-05-07T20:33:09.6307032Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.6307588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.6308047Z module_map=module_map) 2025-05-07T20:33:09.6308413Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.6308765Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.6309022Z E ^ 2025-05-07T20:33:09.6309479Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.6309931Z 2025-05-07T20:33:09.6310344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.6310849Z 2025-05-07T20:33:09.7453525Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.7453978Z self=, 2025-05-07T20:33:09.7454413Z T=1, 2025-05-07T20:33:09.7454617Z D=5120, 2025-05-07T20:33:09.7454817Z scale_ub=None, 2025-05-07T20:33:09.7455037Z contiguous=False, 2025-05-07T20:33:09.7455264Z compiled=False, 2025-05-07T20:33:09.7455476Z ) 2025-05-07T20:33:09.7455799Z self = 2025-05-07T20:33:09.7456281Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:09.7456547Z 2025-05-07T20:33:09.7456626Z @given( 2025-05-07T20:33:09.7456860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.7457165Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.7457467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.7457790Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.7458108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.7458395Z ) 2025-05-07T20:33:09.7458742Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.7459376Z def test_silu_mul_quant( 2025-05-07T20:33:09.7465960Z self, 2025-05-07T20:33:09.7466183Z T: int, 2025-05-07T20:33:09.7466389Z D: int, 2025-05-07T20:33:09.7466629Z scale_ub: Optional[float], 2025-05-07T20:33:09.7466930Z contiguous: bool, 2025-05-07T20:33:09.7467186Z compiled: bool, 2025-05-07T20:33:09.7467434Z ) -> None: 2025-05-07T20:33:09.7467658Z torch.manual_seed(2025) 2025-05-07T20:33:09.7467899Z 2025-05-07T20:33:09.7468176Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.7468530Z 2025-05-07T20:33:09.7468732Z x_sign = torch.sign(x) 2025-05-07T20:33:09.7469067Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.7469378Z x = x_sign * x_clamp 2025-05-07T20:33:09.7469629Z x0 = x[:, :D] 2025-05-07T20:33:09.7469982Z x1 = x[:, D:] 2025-05-07T20:33:09.7470196Z 2025-05-07T20:33:09.7470385Z if contiguous: 2025-05-07T20:33:09.7470678Z x0 = x0.contiguous() 2025-05-07T20:33:09.7470948Z x1 = x1.contiguous() 2025-05-07T20:33:09.7471189Z 2025-05-07T20:33:09.7471383Z if scale_ub is not None: 2025-05-07T20:33:09.7471652Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.7471990Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.7472293Z ) 2025-05-07T20:33:09.7472494Z else: 2025-05-07T20:33:09.7472768Z scale_ub_tensor = None 2025-05-07T20:33:09.7473014Z 2025-05-07T20:33:09.7473253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.7473569Z op = silu_mul_quant 2025-05-07T20:33:09.7473810Z if compiled: 2025-05-07T20:33:09.7474066Z op = torch.compile(op) 2025-05-07T20:33:09.7474367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.7474639Z 2025-05-07T20:33:09.7474823Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.7474987Z 2025-05-07T20:33:09.7475155Z moe/activation_test.py:117: 2025-05-07T20:33:09.7475453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.7475775Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.7476051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.7476737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.7477415Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.7477945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.7478628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.7479337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.7479886Z kernel = self.compile( 2025-05-07T20:33:09.7480428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.7481078Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.7481474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.7481698Z 2025-05-07T20:33:09.7481912Z self = 2025-05-07T20:33:09.7482974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.7484333Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ff1e40>} 2025-05-07T20:33:09.7485657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.7486669Z context = 2025-05-07T20:33:09.7486956Z 2025-05-07T20:33:09.7487126Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.7487638Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.7488106Z module_map=module_map) 2025-05-07T20:33:09.7488469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.7488815Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.7489084Z E ^ 2025-05-07T20:33:09.7489583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.7490080Z 2025-05-07T20:33:09.7490557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.7491061Z 2025-05-07T20:33:09.7491167Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.7491580Z self=, 2025-05-07T20:33:09.7491981Z T=4096, 2025-05-07T20:33:09.7492167Z D=7168, 2025-05-07T20:33:09.7492369Z scale_ub=1200.0, 2025-05-07T20:33:09.7492596Z contiguous=False, 2025-05-07T20:33:09.7492862Z compiled=False, 2025-05-07T20:33:09.7493155Z ) 2025-05-07T20:33:09.7493473Z self = 2025-05-07T20:33:09.7493957Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:09.7494234Z 2025-05-07T20:33:09.7494312Z @given( 2025-05-07T20:33:09.7494540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.7494854Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.7495150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.7495524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.7495853Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.7496130Z ) 2025-05-07T20:33:09.7496479Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.7496919Z def test_silu_mul_quant( 2025-05-07T20:33:09.7497150Z self, 2025-05-07T20:33:09.7497344Z T: int, 2025-05-07T20:33:09.7497554Z D: int, 2025-05-07T20:33:09.7497770Z scale_ub: Optional[float], 2025-05-07T20:33:09.7498042Z contiguous: bool, 2025-05-07T20:33:09.7498284Z compiled: bool, 2025-05-07T20:33:09.7498516Z ) -> None: 2025-05-07T20:33:09.7498730Z torch.manual_seed(2025) 2025-05-07T20:33:09.7498974Z 2025-05-07T20:33:09.7499244Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.7499580Z 2025-05-07T20:33:09.7499773Z x_sign = torch.sign(x) 2025-05-07T20:33:09.7500073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.7500383Z x = x_sign * x_clamp 2025-05-07T20:33:09.7500630Z x0 = x[:, :D] 2025-05-07T20:33:09.7500850Z x1 = x[:, D:] 2025-05-07T20:33:09.7501053Z 2025-05-07T20:33:09.7501243Z if contiguous: 2025-05-07T20:33:09.7501474Z x0 = x0.contiguous() 2025-05-07T20:33:09.7501730Z x1 = x1.contiguous() 2025-05-07T20:33:09.7501979Z 2025-05-07T20:33:09.7502171Z if scale_ub is not None: 2025-05-07T20:33:09.7502440Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.7502775Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.7503089Z ) 2025-05-07T20:33:09.7503279Z else: 2025-05-07T20:33:09.7503483Z scale_ub_tensor = None 2025-05-07T20:33:09.7503738Z 2025-05-07T20:33:09.7503976Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.7504287Z op = silu_mul_quant 2025-05-07T20:33:09.7504539Z if compiled: 2025-05-07T20:33:09.7504794Z op = torch.compile(op) 2025-05-07T20:33:09.7505088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.7505359Z 2025-05-07T20:33:09.7505560Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.7505722Z 2025-05-07T20:33:09.7505823Z moe/activation_test.py:117: 2025-05-07T20:33:09.7506120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.7506447Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.7506723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.7507406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.7508084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.7508667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.7509403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.7510058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.7510581Z kernel = self.compile( 2025-05-07T20:33:09.7511119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.7511803Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.7512195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.7512419Z 2025-05-07T20:33:09.7512627Z self = 2025-05-07T20:33:09.7513696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.7515084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ff3380>} 2025-05-07T20:33:09.7516401Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.7517409Z context = 2025-05-07T20:33:09.7517690Z 2025-05-07T20:33:09.7517865Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.7518376Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.7518843Z module_map=module_map) 2025-05-07T20:33:09.7519214Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.7519565Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.7519826Z E ^ 2025-05-07T20:33:09.7520283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.7520728Z 2025-05-07T20:33:09.7521144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.7521643Z 2025-05-07T20:33:09.7521749Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.7522161Z self=, 2025-05-07T20:33:09.7522559Z T=16384, 2025-05-07T20:33:09.7522756Z D=7168, 2025-05-07T20:33:09.7522950Z scale_ub=None, 2025-05-07T20:33:09.7523171Z contiguous=True, 2025-05-07T20:33:09.7523399Z compiled=True, 2025-05-07T20:33:09.7523603Z ) 2025-05-07T20:33:09.9240941Z self = 2025-05-07T20:33:09.9241483Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:09.9241754Z 2025-05-07T20:33:09.9241844Z @given( 2025-05-07T20:33:09.9242074Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.9242395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.9242704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.9243030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.9243362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.9243642Z ) 2025-05-07T20:33:09.9243986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.9244429Z def test_silu_mul_quant( 2025-05-07T20:33:09.9244669Z self, 2025-05-07T20:33:09.9244867Z T: int, 2025-05-07T20:33:09.9245067Z D: int, 2025-05-07T20:33:09.9245394Z scale_ub: Optional[float], 2025-05-07T20:33:09.9245663Z contiguous: bool, 2025-05-07T20:33:09.9245903Z compiled: bool, 2025-05-07T20:33:09.9246198Z ) -> None: 2025-05-07T20:33:09.9246413Z torch.manual_seed(2025) 2025-05-07T20:33:09.9246656Z 2025-05-07T20:33:09.9246929Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.9247262Z 2025-05-07T20:33:09.9247460Z x_sign = torch.sign(x) 2025-05-07T20:33:09.9247753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.9248126Z x = x_sign * x_clamp 2025-05-07T20:33:09.9248366Z x0 = x[:, :D] 2025-05-07T20:33:09.9248585Z x1 = x[:, D:] 2025-05-07T20:33:09.9248796Z 2025-05-07T20:33:09.9248979Z if contiguous: 2025-05-07T20:33:09.9249209Z x0 = x0.contiguous() 2025-05-07T20:33:09.9249469Z x1 = x1.contiguous() 2025-05-07T20:33:09.9249704Z 2025-05-07T20:33:09.9249901Z if scale_ub is not None: 2025-05-07T20:33:09.9250175Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.9250507Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.9250875Z ) 2025-05-07T20:33:09.9251074Z else: 2025-05-07T20:33:09.9251281Z scale_ub_tensor = None 2025-05-07T20:33:09.9251539Z 2025-05-07T20:33:09.9251780Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.9252093Z op = silu_mul_quant 2025-05-07T20:33:09.9252351Z if compiled: 2025-05-07T20:33:09.9252606Z op = torch.compile(op) 2025-05-07T20:33:09.9252901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.9253238Z 2025-05-07T20:33:09.9253440Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.9253607Z 2025-05-07T20:33:09.9253706Z moe/activation_test.py:117: 2025-05-07T20:33:09.9254005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.9254341Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.9254625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.9255184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.9255743Z return fn(*args, **kwargs) 2025-05-07T20:33:09.9256396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.9257067Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.9257598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.9258275Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.9258932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.9259614Z kernel = self.compile( 2025-05-07T20:33:09.9260159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.9260812Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.9261208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.9261435Z 2025-05-07T20:33:09.9261642Z self = 2025-05-07T20:33:09.9262702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.9264048Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05523904a0>} 2025-05-07T20:33:09.9265364Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.9266868Z context = 2025-05-07T20:33:09.9267164Z 2025-05-07T20:33:09.9267332Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.9267848Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.9268317Z module_map=module_map) 2025-05-07T20:33:09.9268827Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.9269207Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.9269468Z E ^ 2025-05-07T20:33:09.9269929Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.9270379Z 2025-05-07T20:33:09.9270791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.9271303Z 2025-05-07T20:33:09.9271413Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.9271888Z self=, 2025-05-07T20:33:09.9272286Z T=4096, 2025-05-07T20:33:09.9272484Z D=5120, 2025-05-07T20:33:09.9272679Z scale_ub=None, 2025-05-07T20:33:09.9272920Z contiguous=False, 2025-05-07T20:33:09.9273150Z compiled=True, 2025-05-07T20:33:09.9273364Z ) 2025-05-07T20:33:09.9273689Z self = 2025-05-07T20:33:09.9274185Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:09.9274459Z 2025-05-07T20:33:09.9274537Z @given( 2025-05-07T20:33:09.9274774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.9275090Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.9275509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.9275987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.9276414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.9276706Z ) 2025-05-07T20:33:09.9277059Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.9277503Z def test_silu_mul_quant( 2025-05-07T20:33:09.9277744Z self, 2025-05-07T20:33:09.9277939Z T: int, 2025-05-07T20:33:09.9278144Z D: int, 2025-05-07T20:33:09.9278361Z scale_ub: Optional[float], 2025-05-07T20:33:09.9278648Z contiguous: bool, 2025-05-07T20:33:09.9278890Z compiled: bool, 2025-05-07T20:33:09.9279119Z ) -> None: 2025-05-07T20:33:09.9279337Z torch.manual_seed(2025) 2025-05-07T20:33:09.9279585Z 2025-05-07T20:33:09.9279859Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.9280202Z 2025-05-07T20:33:09.9280399Z x_sign = torch.sign(x) 2025-05-07T20:33:09.9280694Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.9280999Z x = x_sign * x_clamp 2025-05-07T20:33:09.9281246Z x0 = x[:, :D] 2025-05-07T20:33:09.9281464Z x1 = x[:, D:] 2025-05-07T20:33:09.9281667Z 2025-05-07T20:33:09.9281858Z if contiguous: 2025-05-07T20:33:09.9282092Z x0 = x0.contiguous() 2025-05-07T20:33:09.9282347Z x1 = x1.contiguous() 2025-05-07T20:33:09.9282590Z 2025-05-07T20:33:09.9282786Z if scale_ub is not None: 2025-05-07T20:33:09.9283055Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.9283393Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.9283698Z ) 2025-05-07T20:33:09.9283893Z else: 2025-05-07T20:33:09.9284100Z scale_ub_tensor = None 2025-05-07T20:33:09.9284353Z 2025-05-07T20:33:09.9284585Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.9284956Z op = silu_mul_quant 2025-05-07T20:33:09.9285208Z if compiled: 2025-05-07T20:33:09.9285455Z op = torch.compile(op) 2025-05-07T20:33:09.9285785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.9286059Z 2025-05-07T20:33:09.9286252Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.9286417Z 2025-05-07T20:33:09.9286519Z moe/activation_test.py:117: 2025-05-07T20:33:09.9286816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.9287140Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.9287454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.9288013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.9288569Z return fn(*args, **kwargs) 2025-05-07T20:33:09.9289223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.9289896Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.9290463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.9291142Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.9291795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.9292315Z kernel = self.compile( 2025-05-07T20:33:09.9292850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.9293591Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.9293983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.9294215Z 2025-05-07T20:33:09.9294420Z self = 2025-05-07T20:33:09.9295494Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.9296839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05523911c0>} 2025-05-07T20:33:09.9298154Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.9299158Z context = 2025-05-07T20:33:09.9299447Z 2025-05-07T20:33:09.9299614Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.9300127Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.9300595Z module_map=module_map) 2025-05-07T20:33:09.9300956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.9301313Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.9301578Z E ^ 2025-05-07T20:33:09.9302036Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.9302493Z 2025-05-07T20:33:09.9302906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.9303415Z 2025-05-07T20:33:10.2360090Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2360587Z self=, 2025-05-07T20:33:10.2360997Z T=4096, 2025-05-07T20:33:10.2361198Z D=5120, 2025-05-07T20:33:10.2361397Z scale_ub=1200.0, 2025-05-07T20:33:10.2361639Z contiguous=False, 2025-05-07T20:33:10.2362130Z compiled=False, 2025-05-07T20:33:10.2362350Z ) 2025-05-07T20:33:10.2362668Z self = 2025-05-07T20:33:10.2363242Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.2363523Z 2025-05-07T20:33:10.2363600Z @given( 2025-05-07T20:33:10.2363833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2364139Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2364442Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2364835Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2365152Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2365442Z ) 2025-05-07T20:33:10.2365791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2366227Z def test_silu_mul_quant( 2025-05-07T20:33:10.2366461Z self, 2025-05-07T20:33:10.2366663Z T: int, 2025-05-07T20:33:10.2366860Z D: int, 2025-05-07T20:33:10.2367075Z scale_ub: Optional[float], 2025-05-07T20:33:10.2367347Z contiguous: bool, 2025-05-07T20:33:10.2367654Z compiled: bool, 2025-05-07T20:33:10.2367882Z ) -> None: 2025-05-07T20:33:10.2368097Z torch.manual_seed(2025) 2025-05-07T20:33:10.2368337Z 2025-05-07T20:33:10.2368603Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2368938Z 2025-05-07T20:33:10.2369126Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2369410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2369720Z x = x_sign * x_clamp 2025-05-07T20:33:10.2369961Z x0 = x[:, :D] 2025-05-07T20:33:10.2370172Z x1 = x[:, D:] 2025-05-07T20:33:10.2370383Z 2025-05-07T20:33:10.2370569Z if contiguous: 2025-05-07T20:33:10.2370804Z x0 = x0.contiguous() 2025-05-07T20:33:10.2371059Z x1 = x1.contiguous() 2025-05-07T20:33:10.2371303Z 2025-05-07T20:33:10.2371494Z if scale_ub is not None: 2025-05-07T20:33:10.2371759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.2372093Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.2372398Z ) 2025-05-07T20:33:10.2372590Z else: 2025-05-07T20:33:10.2372805Z scale_ub_tensor = None 2025-05-07T20:33:10.2373147Z 2025-05-07T20:33:10.2373378Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.2373690Z op = silu_mul_quant 2025-05-07T20:33:10.2373941Z if compiled: 2025-05-07T20:33:10.2374181Z op = torch.compile(op) 2025-05-07T20:33:10.2374477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2374750Z 2025-05-07T20:33:10.2374934Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.2375099Z 2025-05-07T20:33:10.2375198Z moe/activation_test.py:117: 2025-05-07T20:33:10.2375493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2375824Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.2376133Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2382084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.2382774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.2383302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.2383977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.2384636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.2385165Z kernel = self.compile( 2025-05-07T20:33:10.2385698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.2386430Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.2386860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2387088Z 2025-05-07T20:33:10.2387291Z self = 2025-05-07T20:33:10.2388360Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.2389801Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552392160>} 2025-05-07T20:33:10.2391108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.2392116Z context = 2025-05-07T20:33:10.2392398Z 2025-05-07T20:33:10.2392622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.2393138Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.2393600Z module_map=module_map) 2025-05-07T20:33:10.2393974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.2394324Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.2394584Z E ^ 2025-05-07T20:33:10.2395049Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.2395488Z 2025-05-07T20:33:10.2395893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.2396394Z 2025-05-07T20:33:10.2396499Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2396901Z self=, 2025-05-07T20:33:10.2397302Z T=4096, 2025-05-07T20:33:10.2397485Z D=5120, 2025-05-07T20:33:10.2397673Z scale_ub=1200.0, 2025-05-07T20:33:10.2397893Z contiguous=False, 2025-05-07T20:33:10.2398108Z compiled=True, 2025-05-07T20:33:10.2398305Z ) 2025-05-07T20:33:10.2398617Z self = 2025-05-07T20:33:10.2399097Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.2399395Z 2025-05-07T20:33:10.2399487Z @given( 2025-05-07T20:33:10.2399721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2400017Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2400316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2400638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2400958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2401231Z ) 2025-05-07T20:33:10.2401581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2402010Z def test_silu_mul_quant( 2025-05-07T20:33:10.2402247Z self, 2025-05-07T20:33:10.2402443Z T: int, 2025-05-07T20:33:10.2402642Z D: int, 2025-05-07T20:33:10.2402847Z scale_ub: Optional[float], 2025-05-07T20:33:10.2403114Z contiguous: bool, 2025-05-07T20:33:10.2403357Z compiled: bool, 2025-05-07T20:33:10.2403578Z ) -> None: 2025-05-07T20:33:10.2403791Z torch.manual_seed(2025) 2025-05-07T20:33:10.2404027Z 2025-05-07T20:33:10.2404288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2404627Z 2025-05-07T20:33:10.2404816Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2405108Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2405459Z x = x_sign * x_clamp 2025-05-07T20:33:10.2405697Z x0 = x[:, :D] 2025-05-07T20:33:10.2405915Z x1 = x[:, D:] 2025-05-07T20:33:10.2406120Z 2025-05-07T20:33:10.2406347Z if contiguous: 2025-05-07T20:33:10.2406578Z x0 = x0.contiguous() 2025-05-07T20:33:10.2406827Z x1 = x1.contiguous() 2025-05-07T20:33:10.2407065Z 2025-05-07T20:33:10.2407257Z if scale_ub is not None: 2025-05-07T20:33:10.2407521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.2407846Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.2408194Z ) 2025-05-07T20:33:10.2408381Z else: 2025-05-07T20:33:10.2408589Z scale_ub_tensor = None 2025-05-07T20:33:10.2408838Z 2025-05-07T20:33:10.2409061Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.2409370Z op = silu_mul_quant 2025-05-07T20:33:10.2409616Z if compiled: 2025-05-07T20:33:10.2409856Z op = torch.compile(op) 2025-05-07T20:33:10.2410151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2410421Z 2025-05-07T20:33:10.2410659Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.2410819Z 2025-05-07T20:33:10.2410918Z moe/activation_test.py:117: 2025-05-07T20:33:10.2411209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2411531Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.2411801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2412349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.2412900Z return fn(*args, **kwargs) 2025-05-07T20:33:10.2413619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.2414290Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.2414809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.2415481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.2416134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.2416654Z kernel = self.compile( 2025-05-07T20:33:10.2417197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.2417834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.2418220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2418445Z 2025-05-07T20:33:10.2418645Z self = 2025-05-07T20:33:10.2419748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.2421096Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552393240>} 2025-05-07T20:33:10.2422400Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.2423397Z context = 2025-05-07T20:33:10.2423683Z 2025-05-07T20:33:10.2423846Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.2424355Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.2424807Z module_map=module_map) 2025-05-07T20:33:10.2425216Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.2425564Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.2425821Z E ^ 2025-05-07T20:33:10.2426313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.2426757Z 2025-05-07T20:33:10.2427164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.2427662Z 2025-05-07T20:33:10.3552691Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.3553264Z self=, 2025-05-07T20:33:10.3553665Z T=2048, 2025-05-07T20:33:10.3553861Z D=7168, 2025-05-07T20:33:10.3554057Z scale_ub=1200.0, 2025-05-07T20:33:10.3554278Z contiguous=False, 2025-05-07T20:33:10.3554512Z compiled=False, 2025-05-07T20:33:10.3554724Z ) 2025-05-07T20:33:10.3555058Z self = 2025-05-07T20:33:10.3555555Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.3555834Z 2025-05-07T20:33:10.3556005Z @given( 2025-05-07T20:33:10.3556235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.3556547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.3556849Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.3557172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.3557494Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.3557784Z ) 2025-05-07T20:33:10.3558137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.3558564Z def test_silu_mul_quant( 2025-05-07T20:33:10.3558806Z self, 2025-05-07T20:33:10.3559000Z T: int, 2025-05-07T20:33:10.3559399Z D: int, 2025-05-07T20:33:10.3559616Z scale_ub: Optional[float], 2025-05-07T20:33:10.3559889Z contiguous: bool, 2025-05-07T20:33:10.3560126Z compiled: bool, 2025-05-07T20:33:10.3560342Z ) -> None: 2025-05-07T20:33:10.3560560Z torch.manual_seed(2025) 2025-05-07T20:33:10.3560803Z 2025-05-07T20:33:10.3561073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.3561413Z 2025-05-07T20:33:10.3561611Z x_sign = torch.sign(x) 2025-05-07T20:33:10.3561895Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.3562201Z x = x_sign * x_clamp 2025-05-07T20:33:10.3562439Z x0 = x[:, :D] 2025-05-07T20:33:10.3562650Z x1 = x[:, D:] 2025-05-07T20:33:10.3562863Z 2025-05-07T20:33:10.3563048Z if contiguous: 2025-05-07T20:33:10.3563274Z x0 = x0.contiguous() 2025-05-07T20:33:10.3563532Z x1 = x1.contiguous() 2025-05-07T20:33:10.3563776Z 2025-05-07T20:33:10.3563964Z if scale_ub is not None: 2025-05-07T20:33:10.3564245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.3564574Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.3564884Z ) 2025-05-07T20:33:10.3565074Z else: 2025-05-07T20:33:10.3565289Z scale_ub_tensor = None 2025-05-07T20:33:10.3565539Z 2025-05-07T20:33:10.3565767Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.3566082Z op = silu_mul_quant 2025-05-07T20:33:10.3566331Z if compiled: 2025-05-07T20:33:10.3566573Z op = torch.compile(op) 2025-05-07T20:33:10.3566872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.3567143Z 2025-05-07T20:33:10.3567340Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.3567501Z 2025-05-07T20:33:10.3567632Z moe/activation_test.py:117: 2025-05-07T20:33:10.3567926Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.3568260Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.3568612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.3569347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.3570031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.3570555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.3571224Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.3571877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.3572462Z kernel = self.compile( 2025-05-07T20:33:10.3573062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.3573706Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.3574098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.3574327Z 2025-05-07T20:33:10.3574535Z self = 2025-05-07T20:33:10.3575659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.3577007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552ea4220>} 2025-05-07T20:33:10.3578324Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.3579380Z context = 2025-05-07T20:33:10.3579664Z 2025-05-07T20:33:10.3579826Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.3580342Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.3580802Z module_map=module_map) 2025-05-07T20:33:10.3581165Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.3581516Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.3581772Z E ^ 2025-05-07T20:33:10.3582226Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.3582672Z 2025-05-07T20:33:10.3583085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.3583590Z 2025-05-07T20:33:10.3583694Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.3584103Z self=, 2025-05-07T20:33:10.3584511Z T=1, 2025-05-07T20:33:10.3584696Z D=7168, 2025-05-07T20:33:10.3584898Z scale_ub=None, 2025-05-07T20:33:10.3585110Z contiguous=True, 2025-05-07T20:33:10.3585335Z compiled=False, 2025-05-07T20:33:10.3585542Z ) 2025-05-07T20:33:10.3585860Z self = 2025-05-07T20:33:10.3586335Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.3586596Z 2025-05-07T20:33:10.3586674Z @given( 2025-05-07T20:33:10.3586903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.3587212Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.3587512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.3587843Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.3588166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.3588449Z ) 2025-05-07T20:33:10.3588844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.3589314Z def test_silu_mul_quant( 2025-05-07T20:33:10.3589609Z self, 2025-05-07T20:33:10.3589812Z T: int, 2025-05-07T20:33:10.3590012Z D: int, 2025-05-07T20:33:10.3590226Z scale_ub: Optional[float], 2025-05-07T20:33:10.3590496Z contiguous: bool, 2025-05-07T20:33:10.3590739Z compiled: bool, 2025-05-07T20:33:10.3590951Z ) -> None: 2025-05-07T20:33:10.3591167Z torch.manual_seed(2025) 2025-05-07T20:33:10.3591407Z 2025-05-07T20:33:10.3591717Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.3592050Z 2025-05-07T20:33:10.3592244Z x_sign = torch.sign(x) 2025-05-07T20:33:10.3592527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.3592833Z x = x_sign * x_clamp 2025-05-07T20:33:10.3593075Z x0 = x[:, :D] 2025-05-07T20:33:10.3593285Z x1 = x[:, D:] 2025-05-07T20:33:10.3593495Z 2025-05-07T20:33:10.3593680Z if contiguous: 2025-05-07T20:33:10.3593905Z x0 = x0.contiguous() 2025-05-07T20:33:10.3594207Z x1 = x1.contiguous() 2025-05-07T20:33:10.3594449Z 2025-05-07T20:33:10.3594637Z if scale_ub is not None: 2025-05-07T20:33:10.3594918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.3595246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.3595549Z ) 2025-05-07T20:33:10.3595737Z else: 2025-05-07T20:33:10.3595949Z scale_ub_tensor = None 2025-05-07T20:33:10.3596201Z 2025-05-07T20:33:10.3596427Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.3596738Z op = silu_mul_quant 2025-05-07T20:33:10.3596987Z if compiled: 2025-05-07T20:33:10.3597232Z op = torch.compile(op) 2025-05-07T20:33:10.3597530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.3597805Z 2025-05-07T20:33:10.3597992Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.3598159Z 2025-05-07T20:33:10.3598254Z moe/activation_test.py:117: 2025-05-07T20:33:10.3598553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.3598882Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.3599177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.3599888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.3600563Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.3601087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.3601757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.3602410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.3602935Z kernel = self.compile( 2025-05-07T20:33:10.3603472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.3604123Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.3604512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.3604735Z 2025-05-07T20:33:10.3604943Z self = 2025-05-07T20:33:10.3605998Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.3607352Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552ea5120>} 2025-05-07T20:33:10.3608758Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.3609822Z context = 2025-05-07T20:33:10.3610104Z 2025-05-07T20:33:10.3610271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.3610790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.3611294Z module_map=module_map) 2025-05-07T20:33:10.3611659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.3612007Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.3612263Z E ^ 2025-05-07T20:33:10.3612720Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.3613209Z 2025-05-07T20:33:10.3613618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.3614127Z 2025-05-07T20:33:10.3614276Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.3614689Z self=, 2025-05-07T20:33:10.3615086Z T=16384, 2025-05-07T20:33:10.3615273Z D=7168, 2025-05-07T20:33:10.3615469Z scale_ub=1200.0, 2025-05-07T20:33:10.3615689Z contiguous=False, 2025-05-07T20:33:10.3615909Z compiled=True, 2025-05-07T20:33:10.5996778Z ) 2025-05-07T20:33:10.5997123Z self = 2025-05-07T20:33:10.5997663Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.5997957Z 2025-05-07T20:33:10.5998040Z @given( 2025-05-07T20:33:10.5998269Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.5998588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.5998960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.5999633Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.6000285Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.6000853Z ) 2025-05-07T20:33:10.6001547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.6002431Z def test_silu_mul_quant( 2025-05-07T20:33:10.6002922Z self, 2025-05-07T20:33:10.6003318Z T: int, 2025-05-07T20:33:10.6003712Z D: int, 2025-05-07T20:33:10.6004159Z scale_ub: Optional[float], 2025-05-07T20:33:10.6004696Z contiguous: bool, 2025-05-07T20:33:10.6005173Z compiled: bool, 2025-05-07T20:33:10.6005630Z ) -> None: 2025-05-07T20:33:10.6006065Z torch.manual_seed(2025) 2025-05-07T20:33:10.6006545Z 2025-05-07T20:33:10.6007092Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.6007780Z 2025-05-07T20:33:10.6008165Z x_sign = torch.sign(x) 2025-05-07T20:33:10.6008750Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.6009254Z x = x_sign * x_clamp 2025-05-07T20:33:10.6009495Z x0 = x[:, :D] 2025-05-07T20:33:10.6009715Z x1 = x[:, D:] 2025-05-07T20:33:10.6009930Z 2025-05-07T20:33:10.6010116Z if contiguous: 2025-05-07T20:33:10.6010353Z x0 = x0.contiguous() 2025-05-07T20:33:10.6010617Z x1 = x1.contiguous() 2025-05-07T20:33:10.6010855Z 2025-05-07T20:33:10.6011058Z if scale_ub is not None: 2025-05-07T20:33:10.6011339Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.6011676Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.6011983Z ) 2025-05-07T20:33:10.6012187Z else: 2025-05-07T20:33:10.6012404Z scale_ub_tensor = None 2025-05-07T20:33:10.6012660Z 2025-05-07T20:33:10.6013093Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.6013416Z op = silu_mul_quant 2025-05-07T20:33:10.6013732Z if compiled: 2025-05-07T20:33:10.6013990Z op = torch.compile(op) 2025-05-07T20:33:10.6014292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.6014561Z 2025-05-07T20:33:10.6014756Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.6014922Z 2025-05-07T20:33:10.6015051Z moe/activation_test.py:117: 2025-05-07T20:33:10.6015351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.6015774Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.6016061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.6016618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.6017177Z return fn(*args, **kwargs) 2025-05-07T20:33:10.6017825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.6018508Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.6019114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.6019838Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.6020492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.6021023Z kernel = self.compile( 2025-05-07T20:33:10.6021563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.6022207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.6022600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.6022830Z 2025-05-07T20:33:10.6023037Z self = 2025-05-07T20:33:10.6024109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.6025457Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552ea6520>} 2025-05-07T20:33:10.6026779Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.6027788Z context = 2025-05-07T20:33:10.6028071Z 2025-05-07T20:33:10.6028238Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.6028750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.6029236Z module_map=module_map) 2025-05-07T20:33:10.6029622Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.6029973Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.6030226Z E ^ 2025-05-07T20:33:10.6030685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.6031126Z 2025-05-07T20:33:10.6031542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.6032045Z 2025-05-07T20:33:10.6032153Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.6032561Z self=, 2025-05-07T20:33:10.6032960Z T=1, 2025-05-07T20:33:10.6033144Z D=7168, 2025-05-07T20:33:10.6033384Z scale_ub=None, 2025-05-07T20:33:10.6033596Z contiguous=False, 2025-05-07T20:33:10.6033827Z compiled=False, 2025-05-07T20:33:10.6034028Z ) 2025-05-07T20:33:10.6034390Z self = 2025-05-07T20:33:10.6034875Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.6035132Z 2025-05-07T20:33:10.6035211Z @given( 2025-05-07T20:33:10.6035447Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.6035754Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.6036100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.6036421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.6036746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.6037026Z ) 2025-05-07T20:33:10.6037367Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.6037804Z def test_silu_mul_quant( 2025-05-07T20:33:10.6043227Z self, 2025-05-07T20:33:10.6043442Z T: int, 2025-05-07T20:33:10.6043645Z D: int, 2025-05-07T20:33:10.6043942Z scale_ub: Optional[float], 2025-05-07T20:33:10.6044218Z contiguous: bool, 2025-05-07T20:33:10.6044456Z compiled: bool, 2025-05-07T20:33:10.6044685Z ) -> None: 2025-05-07T20:33:10.6044902Z torch.manual_seed(2025) 2025-05-07T20:33:10.6045138Z 2025-05-07T20:33:10.6045406Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.6045750Z 2025-05-07T20:33:10.6045943Z x_sign = torch.sign(x) 2025-05-07T20:33:10.6046228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.6046527Z x = x_sign * x_clamp 2025-05-07T20:33:10.6046756Z x0 = x[:, :D] 2025-05-07T20:33:10.6046971Z x1 = x[:, D:] 2025-05-07T20:33:10.6047178Z 2025-05-07T20:33:10.6047361Z if contiguous: 2025-05-07T20:33:10.6047591Z x0 = x0.contiguous() 2025-05-07T20:33:10.6047843Z x1 = x1.contiguous() 2025-05-07T20:33:10.6048076Z 2025-05-07T20:33:10.6048270Z if scale_ub is not None: 2025-05-07T20:33:10.6048543Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.6048874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.6049182Z ) 2025-05-07T20:33:10.6049372Z else: 2025-05-07T20:33:10.6049586Z scale_ub_tensor = None 2025-05-07T20:33:10.6049828Z 2025-05-07T20:33:10.6050061Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.6050374Z op = silu_mul_quant 2025-05-07T20:33:10.6050614Z if compiled: 2025-05-07T20:33:10.6050859Z op = torch.compile(op) 2025-05-07T20:33:10.6051154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.6051420Z 2025-05-07T20:33:10.6051608Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.6051766Z 2025-05-07T20:33:10.6051867Z moe/activation_test.py:117: 2025-05-07T20:33:10.6052156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.6052484Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.6052760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.6053490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.6054164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.6054689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.6055361Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.6056008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.6056534Z kernel = self.compile( 2025-05-07T20:33:10.6057063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.6057754Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.6058181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.6058413Z 2025-05-07T20:33:10.6058616Z self = 2025-05-07T20:33:10.6059981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.6061477Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552ea7100>} 2025-05-07T20:33:10.6062781Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.6063853Z context = 2025-05-07T20:33:10.6064142Z 2025-05-07T20:33:10.6064305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.6064816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.6065274Z module_map=module_map) 2025-05-07T20:33:10.6065636Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.6065986Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.6066245Z E ^ 2025-05-07T20:33:10.6066699Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.6067143Z 2025-05-07T20:33:10.6067552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.6068054Z 2025-05-07T20:33:10.6068167Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.6068581Z self=, 2025-05-07T20:33:10.6068974Z T=2048, 2025-05-07T20:33:10.6069166Z D=7168, 2025-05-07T20:33:10.6069355Z scale_ub=None, 2025-05-07T20:33:10.6069567Z contiguous=False, 2025-05-07T20:33:10.6069790Z compiled=True, 2025-05-07T20:33:10.6069993Z ) 2025-05-07T20:33:10.6924018Z self = 2025-05-07T20:33:10.6925056Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.6925598Z 2025-05-07T20:33:10.6925751Z @given( 2025-05-07T20:33:10.6926210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.6926817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.6927420Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.6928071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.6928718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.6929155Z ) 2025-05-07T20:33:10.6929511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.6929952Z def test_silu_mul_quant( 2025-05-07T20:33:10.6930192Z self, 2025-05-07T20:33:10.6930395Z T: int, 2025-05-07T20:33:10.6930596Z D: int, 2025-05-07T20:33:10.6930806Z scale_ub: Optional[float], 2025-05-07T20:33:10.6931073Z contiguous: bool, 2025-05-07T20:33:10.6931323Z compiled: bool, 2025-05-07T20:33:10.6931539Z ) -> None: 2025-05-07T20:33:10.6931755Z torch.manual_seed(2025) 2025-05-07T20:33:10.6931999Z 2025-05-07T20:33:10.6932267Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.6932603Z 2025-05-07T20:33:10.6932794Z x_sign = torch.sign(x) 2025-05-07T20:33:10.6933165Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.6933581Z x = x_sign * x_clamp 2025-05-07T20:33:10.6933825Z x0 = x[:, :D] 2025-05-07T20:33:10.6934107Z x1 = x[:, D:] 2025-05-07T20:33:10.6934318Z 2025-05-07T20:33:10.6934508Z if contiguous: 2025-05-07T20:33:10.6934744Z x0 = x0.contiguous() 2025-05-07T20:33:10.6935004Z x1 = x1.contiguous() 2025-05-07T20:33:10.6935253Z 2025-05-07T20:33:10.6935450Z if scale_ub is not None: 2025-05-07T20:33:10.6935723Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.6936129Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.6936444Z ) 2025-05-07T20:33:10.6936643Z else: 2025-05-07T20:33:10.6936855Z scale_ub_tensor = None 2025-05-07T20:33:10.6937119Z 2025-05-07T20:33:10.6937352Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.6937668Z op = silu_mul_quant 2025-05-07T20:33:10.6937925Z if compiled: 2025-05-07T20:33:10.6938177Z op = torch.compile(op) 2025-05-07T20:33:10.6938482Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.6938831Z 2025-05-07T20:33:10.6939031Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.6939212Z 2025-05-07T20:33:10.6939319Z moe/activation_test.py:117: 2025-05-07T20:33:10.6939639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.6939966Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.6940246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.6940802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.6941361Z return fn(*args, **kwargs) 2025-05-07T20:33:10.6942008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.6942687Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.6943220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.6943893Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.6944552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.6945078Z kernel = self.compile( 2025-05-07T20:33:10.6945613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.6946260Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.6946650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.6946881Z 2025-05-07T20:33:10.6947087Z self = 2025-05-07T20:33:10.6948152Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.6949548Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1edc720>} 2025-05-07T20:33:10.6950863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.6951878Z context = 2025-05-07T20:33:10.6952170Z 2025-05-07T20:33:10.6952337Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.6952854Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.6953377Z module_map=module_map) 2025-05-07T20:33:10.6953742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.6954160Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.6954422Z E ^ 2025-05-07T20:33:10.6954884Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.6955339Z 2025-05-07T20:33:10.6955752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.6956258Z 2025-05-07T20:33:10.6956410Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.6956821Z self=, 2025-05-07T20:33:10.6957228Z T=4096, 2025-05-07T20:33:10.6957426Z D=7168, 2025-05-07T20:33:10.6957620Z scale_ub=None, 2025-05-07T20:33:10.6957831Z contiguous=False, 2025-05-07T20:33:10.6958059Z compiled=True, 2025-05-07T20:33:10.6958262Z ) 2025-05-07T20:33:10.6958584Z self = 2025-05-07T20:33:10.6959123Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:10.6959561Z 2025-05-07T20:33:10.6959646Z @given( 2025-05-07T20:33:10.6959875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.6960194Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.6960496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.6960827Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.6961159Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.6961446Z ) 2025-05-07T20:33:10.6961791Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.6962236Z def test_silu_mul_quant( 2025-05-07T20:33:10.6962484Z self, 2025-05-07T20:33:10.6962677Z T: int, 2025-05-07T20:33:10.6962875Z D: int, 2025-05-07T20:33:10.6963099Z scale_ub: Optional[float], 2025-05-07T20:33:10.6963368Z contiguous: bool, 2025-05-07T20:33:10.6963615Z compiled: bool, 2025-05-07T20:33:10.6963851Z ) -> None: 2025-05-07T20:33:10.6964060Z torch.manual_seed(2025) 2025-05-07T20:33:10.6964306Z 2025-05-07T20:33:10.6964584Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.6964920Z 2025-05-07T20:33:10.6965115Z x_sign = torch.sign(x) 2025-05-07T20:33:10.6965404Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.6965715Z x = x_sign * x_clamp 2025-05-07T20:33:10.6965958Z x0 = x[:, :D] 2025-05-07T20:33:10.6966194Z x1 = x[:, D:] 2025-05-07T20:33:10.6966404Z 2025-05-07T20:33:10.6966590Z if contiguous: 2025-05-07T20:33:10.6966826Z x0 = x0.contiguous() 2025-05-07T20:33:10.6967087Z x1 = x1.contiguous() 2025-05-07T20:33:10.6967324Z 2025-05-07T20:33:10.6967521Z if scale_ub is not None: 2025-05-07T20:33:10.6967798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.6968138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.6968449Z ) 2025-05-07T20:33:10.6968652Z else: 2025-05-07T20:33:10.6968866Z scale_ub_tensor = None 2025-05-07T20:33:10.6969125Z 2025-05-07T20:33:10.6969396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.6969726Z op = silu_mul_quant 2025-05-07T20:33:10.6969982Z if compiled: 2025-05-07T20:33:10.6970245Z op = torch.compile(op) 2025-05-07T20:33:10.6970547Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.6970817Z 2025-05-07T20:33:10.6971013Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.6971177Z 2025-05-07T20:33:10.6971279Z moe/activation_test.py:117: 2025-05-07T20:33:10.6971570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.6971974Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.6972266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.6972872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.6973471Z return fn(*args, **kwargs) 2025-05-07T20:33:10.6974126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.6974813Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.6975344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.6976080Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.6976736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.6977259Z kernel = self.compile( 2025-05-07T20:33:10.6977799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.6978510Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.6978907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.6979138Z 2025-05-07T20:33:10.6979363Z self = 2025-05-07T20:33:10.6980449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.6981794Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1edd440>} 2025-05-07T20:33:10.6983113Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.6984129Z context = 2025-05-07T20:33:10.6984412Z 2025-05-07T20:33:10.6984579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.6985101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.6985566Z module_map=module_map) 2025-05-07T20:33:10.6985929Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.6986286Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.6986549Z E ^ 2025-05-07T20:33:10.6987010Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.6987452Z 2025-05-07T20:33:10.6987863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.6988370Z 2025-05-07T20:33:10.8552132Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.8552982Z self=, 2025-05-07T20:33:10.8553776Z T=16384, 2025-05-07T20:33:10.8554211Z D=5120, 2025-05-07T20:33:10.8554594Z scale_ub=1200.0, 2025-05-07T20:33:10.8555042Z contiguous=False, 2025-05-07T20:33:10.8555482Z compiled=False, 2025-05-07T20:33:10.8555885Z ) 2025-05-07T20:33:10.8556508Z self = 2025-05-07T20:33:10.8557493Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.8558044Z 2025-05-07T20:33:10.8558198Z @given( 2025-05-07T20:33:10.8558639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.8559182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.8559794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.8560121Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.8560522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.8560810Z ) 2025-05-07T20:33:10.8561160Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.8561596Z def test_silu_mul_quant( 2025-05-07T20:33:10.8561842Z self, 2025-05-07T20:33:10.8562039Z T: int, 2025-05-07T20:33:10.8562228Z D: int, 2025-05-07T20:33:10.8562449Z scale_ub: Optional[float], 2025-05-07T20:33:10.8562784Z contiguous: bool, 2025-05-07T20:33:10.8563020Z compiled: bool, 2025-05-07T20:33:10.8563247Z ) -> None: 2025-05-07T20:33:10.8563465Z torch.manual_seed(2025) 2025-05-07T20:33:10.8563701Z 2025-05-07T20:33:10.8563971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.8564310Z 2025-05-07T20:33:10.8564509Z x_sign = torch.sign(x) 2025-05-07T20:33:10.8564803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.8565116Z x = x_sign * x_clamp 2025-05-07T20:33:10.8565415Z x0 = x[:, :D] 2025-05-07T20:33:10.8565640Z x1 = x[:, D:] 2025-05-07T20:33:10.8565851Z 2025-05-07T20:33:10.8566040Z if contiguous: 2025-05-07T20:33:10.8566276Z x0 = x0.contiguous() 2025-05-07T20:33:10.8566535Z x1 = x1.contiguous() 2025-05-07T20:33:10.8566788Z 2025-05-07T20:33:10.8566975Z if scale_ub is not None: 2025-05-07T20:33:10.8567254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.8567600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.8567936Z ) 2025-05-07T20:33:10.8568135Z else: 2025-05-07T20:33:10.8568357Z scale_ub_tensor = None 2025-05-07T20:33:10.8568622Z 2025-05-07T20:33:10.8568864Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.8569214Z op = silu_mul_quant 2025-05-07T20:33:10.8569527Z if compiled: 2025-05-07T20:33:10.8569792Z op = torch.compile(op) 2025-05-07T20:33:10.8570118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8570415Z 2025-05-07T20:33:10.8570613Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.8570794Z 2025-05-07T20:33:10.8570899Z moe/activation_test.py:117: 2025-05-07T20:33:10.8571227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8571593Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.8571910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8572720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.8573514Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.8574045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.8574730Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.8575394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.8575915Z kernel = self.compile( 2025-05-07T20:33:10.8576452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.8577099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.8577497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8577731Z 2025-05-07T20:33:10.8577936Z self = 2025-05-07T20:33:10.8578997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.8580535Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ede340>} 2025-05-07T20:33:10.8581856Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.8582861Z context = 2025-05-07T20:33:10.8583191Z 2025-05-07T20:33:10.8583359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.8583876Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.8584339Z module_map=module_map) 2025-05-07T20:33:10.8584702Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.8585062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.8585327Z E ^ 2025-05-07T20:33:10.8585826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.8586276Z 2025-05-07T20:33:10.8586690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.8587196Z 2025-05-07T20:33:10.8587303Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.8587719Z self=, 2025-05-07T20:33:10.8588117Z T=16384, 2025-05-07T20:33:10.8588315Z D=5120, 2025-05-07T20:33:10.8588509Z scale_ub=1200.0, 2025-05-07T20:33:10.8588734Z contiguous=True, 2025-05-07T20:33:10.8588958Z compiled=True, 2025-05-07T20:33:10.8589168Z ) 2025-05-07T20:33:10.8589484Z self = 2025-05-07T20:33:10.8589980Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.8590254Z 2025-05-07T20:33:10.8590342Z @given( 2025-05-07T20:33:10.8590585Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.8590896Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.8591207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.8591536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.8591862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.8592149Z ) 2025-05-07T20:33:10.8592505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.8592940Z def test_silu_mul_quant( 2025-05-07T20:33:10.8593186Z self, 2025-05-07T20:33:10.8593386Z T: int, 2025-05-07T20:33:10.8593586Z D: int, 2025-05-07T20:33:10.8593804Z scale_ub: Optional[float], 2025-05-07T20:33:10.8594074Z contiguous: bool, 2025-05-07T20:33:10.8594315Z compiled: bool, 2025-05-07T20:33:10.8594540Z ) -> None: 2025-05-07T20:33:10.8594760Z torch.manual_seed(2025) 2025-05-07T20:33:10.8595007Z 2025-05-07T20:33:10.8595280Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.8595616Z 2025-05-07T20:33:10.8595811Z x_sign = torch.sign(x) 2025-05-07T20:33:10.8596096Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.8596406Z x = x_sign * x_clamp 2025-05-07T20:33:10.8596652Z x0 = x[:, :D] 2025-05-07T20:33:10.8596868Z x1 = x[:, D:] 2025-05-07T20:33:10.8597080Z 2025-05-07T20:33:10.8597274Z if contiguous: 2025-05-07T20:33:10.8597502Z x0 = x0.contiguous() 2025-05-07T20:33:10.8597765Z x1 = x1.contiguous() 2025-05-07T20:33:10.8598011Z 2025-05-07T20:33:10.8598201Z if scale_ub is not None: 2025-05-07T20:33:10.8598476Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.8598880Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.8599183Z ) 2025-05-07T20:33:10.8599404Z else: 2025-05-07T20:33:10.8599684Z scale_ub_tensor = None 2025-05-07T20:33:10.8599931Z 2025-05-07T20:33:10.8600160Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.8600476Z op = silu_mul_quant 2025-05-07T20:33:10.8600723Z if compiled: 2025-05-07T20:33:10.8600971Z op = torch.compile(op) 2025-05-07T20:33:10.8601267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8601583Z 2025-05-07T20:33:10.8601775Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.8601943Z 2025-05-07T20:33:10.8602044Z moe/activation_test.py:117: 2025-05-07T20:33:10.8602340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8602669Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.8602953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8603509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.8604104Z return fn(*args, **kwargs) 2025-05-07T20:33:10.8604752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.8605431Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.8605959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.8606627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.8607282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.8607803Z kernel = self.compile( 2025-05-07T20:33:10.8608335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.8608974Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.8609372Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8609613Z 2025-05-07T20:33:10.8609858Z self = 2025-05-07T20:33:10.8616145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.8617510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1edf9c0>} 2025-05-07T20:33:10.8618829Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.8619835Z context = 2025-05-07T20:33:10.8620120Z 2025-05-07T20:33:10.8620294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.8620807Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.8621270Z module_map=module_map) 2025-05-07T20:33:10.8621627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.8621976Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.8622234Z E ^ 2025-05-07T20:33:10.8622689Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.8623135Z 2025-05-07T20:33:10.8623547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.8624121Z 2025-05-07T20:33:11.0304254Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.0304845Z self=, 2025-05-07T20:33:11.0305255Z T=16384, 2025-05-07T20:33:11.0305440Z D=5120, 2025-05-07T20:33:11.0305632Z scale_ub=None, 2025-05-07T20:33:11.0305846Z contiguous=False, 2025-05-07T20:33:11.0306071Z compiled=True, 2025-05-07T20:33:11.0306272Z ) 2025-05-07T20:33:11.0306581Z self = 2025-05-07T20:33:11.0307075Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.0307434Z 2025-05-07T20:33:11.0307516Z @given( 2025-05-07T20:33:11.0307746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.0308050Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.0308346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.0308674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.0309005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.0309277Z ) 2025-05-07T20:33:11.0309694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.0310131Z def test_silu_mul_quant( 2025-05-07T20:33:11.0310366Z self, 2025-05-07T20:33:11.0310559Z T: int, 2025-05-07T20:33:11.0310754Z D: int, 2025-05-07T20:33:11.0310968Z scale_ub: Optional[float], 2025-05-07T20:33:11.0311230Z contiguous: bool, 2025-05-07T20:33:11.0311462Z compiled: bool, 2025-05-07T20:33:11.0311681Z ) -> None: 2025-05-07T20:33:11.0311893Z torch.manual_seed(2025) 2025-05-07T20:33:11.0312128Z 2025-05-07T20:33:11.0312403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.0312739Z 2025-05-07T20:33:11.0312933Z x_sign = torch.sign(x) 2025-05-07T20:33:11.0313223Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.0313529Z x = x_sign * x_clamp 2025-05-07T20:33:11.0313770Z x0 = x[:, :D] 2025-05-07T20:33:11.0313982Z x1 = x[:, D:] 2025-05-07T20:33:11.0314185Z 2025-05-07T20:33:11.0314373Z if contiguous: 2025-05-07T20:33:11.0314600Z x0 = x0.contiguous() 2025-05-07T20:33:11.0314854Z x1 = x1.contiguous() 2025-05-07T20:33:11.0315094Z 2025-05-07T20:33:11.0315288Z if scale_ub is not None: 2025-05-07T20:33:11.0315556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.0315887Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.0316195Z ) 2025-05-07T20:33:11.0316387Z else: 2025-05-07T20:33:11.0316594Z scale_ub_tensor = None 2025-05-07T20:33:11.0316844Z 2025-05-07T20:33:11.0317077Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.0317383Z op = silu_mul_quant 2025-05-07T20:33:11.0317629Z if compiled: 2025-05-07T20:33:11.0317880Z op = torch.compile(op) 2025-05-07T20:33:11.0318164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.0318437Z 2025-05-07T20:33:11.0318631Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.0318792Z 2025-05-07T20:33:11.0318893Z moe/activation_test.py:117: 2025-05-07T20:33:11.0319186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.0319563Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.0319846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.0320393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.0320948Z return fn(*args, **kwargs) 2025-05-07T20:33:11.0321592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.0322261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.0322786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.0323572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.0324220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.0324738Z kernel = self.compile( 2025-05-07T20:33:11.0325269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.0325910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.0326336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.0326567Z 2025-05-07T20:33:11.0326770Z self = 2025-05-07T20:33:11.0327826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.0329209Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1c00c20>} 2025-05-07T20:33:11.0330580Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.0331587Z context = 2025-05-07T20:33:11.0331868Z 2025-05-07T20:33:11.0332030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.0332541Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.0333071Z module_map=module_map) 2025-05-07T20:33:11.0333430Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.0333776Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.0334038Z E ^ 2025-05-07T20:33:11.0334508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.0334957Z 2025-05-07T20:33:11.0335369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.0335874Z 2025-05-07T20:33:11.0335975Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.0336383Z self=, 2025-05-07T20:33:11.0336773Z T=2048, 2025-05-07T20:33:11.0336958Z D=5120, 2025-05-07T20:33:11.0337148Z scale_ub=None, 2025-05-07T20:33:11.0337356Z contiguous=False, 2025-05-07T20:33:11.0337580Z compiled=True, 2025-05-07T20:33:11.0337780Z ) 2025-05-07T20:33:11.1230797Z self = 2025-05-07T20:33:11.1231872Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.1232423Z 2025-05-07T20:33:11.1232586Z @given( 2025-05-07T20:33:11.1233052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.1233668Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.1234269Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.1234930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.1235584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.1236150Z ) 2025-05-07T20:33:11.1236850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.1237726Z def test_silu_mul_quant( 2025-05-07T20:33:11.1238213Z self, 2025-05-07T20:33:11.1238597Z T: int, 2025-05-07T20:33:11.1238996Z D: int, 2025-05-07T20:33:11.1239302Z scale_ub: Optional[float], 2025-05-07T20:33:11.1239671Z contiguous: bool, 2025-05-07T20:33:11.1239913Z compiled: bool, 2025-05-07T20:33:11.1240146Z ) -> None: 2025-05-07T20:33:11.1240428Z torch.manual_seed(2025) 2025-05-07T20:33:11.1240681Z 2025-05-07T20:33:11.1240963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.1241297Z 2025-05-07T20:33:11.1241499Z x_sign = torch.sign(x) 2025-05-07T20:33:11.1241790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.1242095Z x = x_sign * x_clamp 2025-05-07T20:33:11.1242407Z x0 = x[:, :D] 2025-05-07T20:33:11.1242624Z x1 = x[:, D:] 2025-05-07T20:33:11.1242826Z 2025-05-07T20:33:11.1243022Z if contiguous: 2025-05-07T20:33:11.1243260Z x0 = x0.contiguous() 2025-05-07T20:33:11.1243527Z x1 = x1.contiguous() 2025-05-07T20:33:11.1243766Z 2025-05-07T20:33:11.1243963Z if scale_ub is not None: 2025-05-07T20:33:11.1244243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.1244572Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.1244886Z ) 2025-05-07T20:33:11.1245169Z else: 2025-05-07T20:33:11.1245384Z scale_ub_tensor = None 2025-05-07T20:33:11.1245636Z 2025-05-07T20:33:11.1245874Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.1246190Z op = silu_mul_quant 2025-05-07T20:33:11.1246445Z if compiled: 2025-05-07T20:33:11.1246700Z op = torch.compile(op) 2025-05-07T20:33:11.1246997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.1247274Z 2025-05-07T20:33:11.1247470Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.1247636Z 2025-05-07T20:33:11.1247738Z moe/activation_test.py:117: 2025-05-07T20:33:11.1248042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.1248373Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.1248652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.1249215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.1249772Z return fn(*args, **kwargs) 2025-05-07T20:33:11.1250429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.1251107Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.1251642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.1252316Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.1253027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.1253555Z kernel = self.compile( 2025-05-07T20:33:11.1254095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.1254749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.1255143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.1255372Z 2025-05-07T20:33:11.1255582Z self = 2025-05-07T20:33:11.1256650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.1258002Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1c019e0>} 2025-05-07T20:33:11.1259490Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.1260640Z context = 2025-05-07T20:33:11.1260936Z 2025-05-07T20:33:11.1261106Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.1261622Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.1262080Z module_map=module_map) 2025-05-07T20:33:11.1262446Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.1262864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.1263123Z E ^ 2025-05-07T20:33:11.1263595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.1264046Z 2025-05-07T20:33:11.1264458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.1264966Z 2025-05-07T20:33:11.1265079Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.1265550Z self=, 2025-05-07T20:33:11.1265954Z T=2048, 2025-05-07T20:33:11.1266151Z D=5120, 2025-05-07T20:33:11.1266345Z scale_ub=1200.0, 2025-05-07T20:33:11.1266574Z contiguous=False, 2025-05-07T20:33:11.1266808Z compiled=True, 2025-05-07T20:33:11.1267010Z ) 2025-05-07T20:33:11.1267332Z self = 2025-05-07T20:33:11.1267827Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.1268098Z 2025-05-07T20:33:11.1268185Z @given( 2025-05-07T20:33:11.1268416Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.1268732Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.1269040Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.1269371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.1269706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.1269998Z ) 2025-05-07T20:33:11.1270350Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.1270795Z def test_silu_mul_quant( 2025-05-07T20:33:11.1271040Z self, 2025-05-07T20:33:11.1271242Z T: int, 2025-05-07T20:33:11.1271441Z D: int, 2025-05-07T20:33:11.1271668Z scale_ub: Optional[float], 2025-05-07T20:33:11.1271940Z contiguous: bool, 2025-05-07T20:33:11.1272185Z compiled: bool, 2025-05-07T20:33:11.1272412Z ) -> None: 2025-05-07T20:33:11.1272629Z torch.manual_seed(2025) 2025-05-07T20:33:11.1272870Z 2025-05-07T20:33:11.1273144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.1273486Z 2025-05-07T20:33:11.1273685Z x_sign = torch.sign(x) 2025-05-07T20:33:11.1273984Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.1274298Z x = x_sign * x_clamp 2025-05-07T20:33:11.1274538Z x0 = x[:, :D] 2025-05-07T20:33:11.1274769Z x1 = x[:, D:] 2025-05-07T20:33:11.1274983Z 2025-05-07T20:33:11.1275168Z if contiguous: 2025-05-07T20:33:11.1275411Z x0 = x0.contiguous() 2025-05-07T20:33:11.1275677Z x1 = x1.contiguous() 2025-05-07T20:33:11.1275922Z 2025-05-07T20:33:11.1276117Z if scale_ub is not None: 2025-05-07T20:33:11.1276401Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.1276746Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.1277050Z ) 2025-05-07T20:33:11.1277247Z else: 2025-05-07T20:33:11.1277461Z scale_ub_tensor = None 2025-05-07T20:33:11.1277710Z 2025-05-07T20:33:11.1277945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.1278261Z op = silu_mul_quant 2025-05-07T20:33:11.1278562Z if compiled: 2025-05-07T20:33:11.1278814Z op = torch.compile(op) 2025-05-07T20:33:11.1279155Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.1279465Z 2025-05-07T20:33:11.1279677Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.1279842Z 2025-05-07T20:33:11.1279947Z moe/activation_test.py:117: 2025-05-07T20:33:11.1280241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.1280573Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.1280857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.1281452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.1282003Z return fn(*args, **kwargs) 2025-05-07T20:33:11.1282655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.1283337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.1283869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.1284581Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.1285236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.1285761Z kernel = self.compile( 2025-05-07T20:33:11.1286300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.1286953Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.1287354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.1287580Z 2025-05-07T20:33:11.1287791Z self = 2025-05-07T20:33:11.1288856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.1290264Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1c02b60>} 2025-05-07T20:33:11.1291581Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.1292592Z context = 2025-05-07T20:33:11.1292875Z 2025-05-07T20:33:11.1293103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.1293619Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.1294092Z module_map=module_map) 2025-05-07T20:33:11.1294461Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.1294818Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.1295085Z E ^ 2025-05-07T20:33:11.1295550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.1295995Z 2025-05-07T20:33:11.1296411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.1296917Z 2025-05-07T20:33:11.3008760Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.3009193Z self=, 2025-05-07T20:33:11.3009644Z T=4096, 2025-05-07T20:33:11.3009840Z D=5120, 2025-05-07T20:33:11.3010025Z scale_ub=1200.0, 2025-05-07T20:33:11.3010242Z contiguous=True, 2025-05-07T20:33:11.3010463Z compiled=True, 2025-05-07T20:33:11.3010777Z ) 2025-05-07T20:33:11.3011094Z self = 2025-05-07T20:33:11.3011635Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.3011901Z 2025-05-07T20:33:11.3011977Z @given( 2025-05-07T20:33:11.3012201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.3012508Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.3012802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.3013222Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.3013630Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.3013911Z ) 2025-05-07T20:33:11.3014273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.3014701Z def test_silu_mul_quant( 2025-05-07T20:33:11.3014936Z self, 2025-05-07T20:33:11.3015137Z T: int, 2025-05-07T20:33:11.3015329Z D: int, 2025-05-07T20:33:11.3015542Z scale_ub: Optional[float], 2025-05-07T20:33:11.3015806Z contiguous: bool, 2025-05-07T20:33:11.3016040Z compiled: bool, 2025-05-07T20:33:11.3016327Z ) -> None: 2025-05-07T20:33:11.3016546Z torch.manual_seed(2025) 2025-05-07T20:33:11.3016780Z 2025-05-07T20:33:11.3017053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.3017384Z 2025-05-07T20:33:11.3017575Z x_sign = torch.sign(x) 2025-05-07T20:33:11.3017865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.3018171Z x = x_sign * x_clamp 2025-05-07T20:33:11.3018401Z x0 = x[:, :D] 2025-05-07T20:33:11.3018611Z x1 = x[:, D:] 2025-05-07T20:33:11.3018816Z 2025-05-07T20:33:11.3019005Z if contiguous: 2025-05-07T20:33:11.3019230Z x0 = x0.contiguous() 2025-05-07T20:33:11.3019485Z x1 = x1.contiguous() 2025-05-07T20:33:11.3019729Z 2025-05-07T20:33:11.3019916Z if scale_ub is not None: 2025-05-07T20:33:11.3020186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.3020525Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.3020823Z ) 2025-05-07T20:33:11.3021013Z else: 2025-05-07T20:33:11.3021225Z scale_ub_tensor = None 2025-05-07T20:33:11.3021469Z 2025-05-07T20:33:11.3021701Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.3022011Z op = silu_mul_quant 2025-05-07T20:33:11.3022254Z if compiled: 2025-05-07T20:33:11.3022503Z op = torch.compile(op) 2025-05-07T20:33:11.3022794Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.3023059Z 2025-05-07T20:33:11.3023249Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.3023416Z 2025-05-07T20:33:11.3023511Z moe/activation_test.py:117: 2025-05-07T20:33:11.3023805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.3024128Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.3024400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.3024952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.3025495Z return fn(*args, **kwargs) 2025-05-07T20:33:11.3026140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.3026811Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.3027342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.3028002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.3028646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.3029166Z kernel = self.compile( 2025-05-07T20:33:11.3029791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.3030479Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.3030869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.3031089Z 2025-05-07T20:33:11.3031300Z self = 2025-05-07T20:33:11.3032357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.3033778Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1978180>} 2025-05-07T20:33:11.3035093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.3036137Z context = 2025-05-07T20:33:11.3036423Z 2025-05-07T20:33:11.3036590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.3037098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.3037563Z module_map=module_map) 2025-05-07T20:33:11.3037928Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.3038274Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.3038538Z E ^ 2025-05-07T20:33:11.3038998Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.3039441Z 2025-05-07T20:33:11.3039904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.3040409Z 2025-05-07T20:33:11.3040522Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.3040932Z self=, 2025-05-07T20:33:11.3041331Z T=128, 2025-05-07T20:33:11.3041513Z D=5120, 2025-05-07T20:33:11.3041708Z scale_ub=1200.0, 2025-05-07T20:33:11.3041931Z contiguous=False, 2025-05-07T20:33:11.3042156Z compiled=True, 2025-05-07T20:33:11.3042357Z ) 2025-05-07T20:33:11.5710282Z self = 2025-05-07T20:33:11.5711228Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.5711734Z 2025-05-07T20:33:11.5711884Z @given( 2025-05-07T20:33:11.5712307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.5712884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.5721387Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.5721732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.5722069Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.5722352Z ) 2025-05-07T20:33:11.5722699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.5723173Z def test_silu_mul_quant( 2025-05-07T20:33:11.5723413Z self, 2025-05-07T20:33:11.5723609Z T: int, 2025-05-07T20:33:11.5723810Z D: int, 2025-05-07T20:33:11.5724026Z scale_ub: Optional[float], 2025-05-07T20:33:11.5724300Z contiguous: bool, 2025-05-07T20:33:11.5724545Z compiled: bool, 2025-05-07T20:33:11.5724762Z ) -> None: 2025-05-07T20:33:11.5724972Z torch.manual_seed(2025) 2025-05-07T20:33:11.5725210Z 2025-05-07T20:33:11.5725479Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.5725811Z 2025-05-07T20:33:11.5726126Z x_sign = torch.sign(x) 2025-05-07T20:33:11.5726408Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.5726772Z x = x_sign * x_clamp 2025-05-07T20:33:11.5727019Z x0 = x[:, :D] 2025-05-07T20:33:11.5727231Z x1 = x[:, D:] 2025-05-07T20:33:11.5727426Z 2025-05-07T20:33:11.5727609Z if contiguous: 2025-05-07T20:33:11.5727841Z x0 = x0.contiguous() 2025-05-07T20:33:11.5728089Z x1 = x1.contiguous() 2025-05-07T20:33:11.5728323Z 2025-05-07T20:33:11.5728510Z if scale_ub is not None: 2025-05-07T20:33:11.5728840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.5729172Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.5729474Z ) 2025-05-07T20:33:11.5729662Z else: 2025-05-07T20:33:11.5729878Z scale_ub_tensor = None 2025-05-07T20:33:11.5730124Z 2025-05-07T20:33:11.5730353Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.5730657Z op = silu_mul_quant 2025-05-07T20:33:11.5730900Z if compiled: 2025-05-07T20:33:11.5731145Z op = torch.compile(op) 2025-05-07T20:33:11.5731496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.5731767Z 2025-05-07T20:33:11.5731954Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.5732112Z 2025-05-07T20:33:11.5732211Z moe/activation_test.py:117: 2025-05-07T20:33:11.5732504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.5732834Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.5733189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.5733738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.5734293Z return fn(*args, **kwargs) 2025-05-07T20:33:11.5734940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.5735605Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.5736136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.5736800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.5737452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.5737969Z kernel = self.compile( 2025-05-07T20:33:11.5738501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.5739145Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.5739531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.5739761Z 2025-05-07T20:33:11.5739966Z self = 2025-05-07T20:33:11.5741032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.5742378Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1978ea0>} 2025-05-07T20:33:11.5743688Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.5744693Z context = 2025-05-07T20:33:11.5744983Z 2025-05-07T20:33:11.5745147Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.5745658Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.5746167Z module_map=module_map) 2025-05-07T20:33:11.5746565Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.5746920Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.5747177Z E ^ 2025-05-07T20:33:11.5747631Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.5748077Z 2025-05-07T20:33:11.5748485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.5749026Z 2025-05-07T20:33:11.5749131Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.5749566Z self=, 2025-05-07T20:33:11.5749984Z T=16384, 2025-05-07T20:33:11.5750174Z D=7168, 2025-05-07T20:33:11.5750365Z scale_ub=1200.0, 2025-05-07T20:33:11.5750579Z contiguous=True, 2025-05-07T20:33:11.5750795Z compiled=True, 2025-05-07T20:33:11.5751000Z ) 2025-05-07T20:33:11.5751352Z self = 2025-05-07T20:33:11.5751839Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.5752107Z 2025-05-07T20:33:11.5752193Z @given( 2025-05-07T20:33:11.5752416Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.5752722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.5753022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.5753348Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.5753662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.5753939Z ) 2025-05-07T20:33:11.5754288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.5754711Z def test_silu_mul_quant( 2025-05-07T20:33:11.5754947Z self, 2025-05-07T20:33:11.5755146Z T: int, 2025-05-07T20:33:11.5755339Z D: int, 2025-05-07T20:33:11.5755551Z scale_ub: Optional[float], 2025-05-07T20:33:11.5755819Z contiguous: bool, 2025-05-07T20:33:11.5756052Z compiled: bool, 2025-05-07T20:33:11.5756278Z ) -> None: 2025-05-07T20:33:11.5756487Z torch.manual_seed(2025) 2025-05-07T20:33:11.5756721Z 2025-05-07T20:33:11.5756995Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.5757330Z 2025-05-07T20:33:11.5757527Z x_sign = torch.sign(x) 2025-05-07T20:33:11.5757814Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.5758119Z x = x_sign * x_clamp 2025-05-07T20:33:11.5758353Z x0 = x[:, :D] 2025-05-07T20:33:11.5758560Z x1 = x[:, D:] 2025-05-07T20:33:11.5758761Z 2025-05-07T20:33:11.5758944Z if contiguous: 2025-05-07T20:33:11.5759166Z x0 = x0.contiguous() 2025-05-07T20:33:11.5759684Z x1 = x1.contiguous() 2025-05-07T20:33:11.5759954Z 2025-05-07T20:33:11.5760165Z if scale_ub is not None: 2025-05-07T20:33:11.5760443Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.5760776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.5761072Z ) 2025-05-07T20:33:11.5761260Z else: 2025-05-07T20:33:11.5761469Z scale_ub_tensor = None 2025-05-07T20:33:11.5761712Z 2025-05-07T20:33:11.5761941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.5762250Z op = silu_mul_quant 2025-05-07T20:33:11.5762503Z if compiled: 2025-05-07T20:33:11.5762741Z op = torch.compile(op) 2025-05-07T20:33:11.5763028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.5763298Z 2025-05-07T20:33:11.5763483Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.5763649Z 2025-05-07T20:33:11.5763742Z moe/activation_test.py:117: 2025-05-07T20:33:11.5764103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.5764420Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.5764756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.5765307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.5765852Z return fn(*args, **kwargs) 2025-05-07T20:33:11.5766492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.5767224Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.5767749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.5768405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.5769054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.5769579Z kernel = self.compile( 2025-05-07T20:33:11.5770177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.5770814Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.5771205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.5771426Z 2025-05-07T20:33:11.5771632Z self = 2025-05-07T20:33:11.5772685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.5774064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a197a0c0>} 2025-05-07T20:33:11.5775379Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.5776377Z context = 2025-05-07T20:33:11.5776657Z 2025-05-07T20:33:11.5776820Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.5777321Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.5777782Z module_map=module_map) 2025-05-07T20:33:11.5778143Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.5778490Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.5778737Z E ^ 2025-05-07T20:33:11.5779192Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.5779634Z 2025-05-07T20:33:11.5780044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.5780543Z 2025-05-07T20:33:11.6987759Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.6988656Z self=, 2025-05-07T20:33:11.6989425Z T=16384, 2025-05-07T20:33:11.6989648Z D=5120, 2025-05-07T20:33:11.6989837Z scale_ub=1200.0, 2025-05-07T20:33:11.6990066Z contiguous=True, 2025-05-07T20:33:11.6990287Z compiled=False, 2025-05-07T20:33:11.6990496Z ) 2025-05-07T20:33:11.6990816Z self = 2025-05-07T20:33:11.6991310Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.6991582Z 2025-05-07T20:33:11.6991660Z @given( 2025-05-07T20:33:11.6991891Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.6992309Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.6992615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.6993023Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.6993349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.6993627Z ) 2025-05-07T20:33:11.6993967Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.6994405Z def test_silu_mul_quant( 2025-05-07T20:33:11.6994640Z self, 2025-05-07T20:33:11.6994830Z T: int, 2025-05-07T20:33:11.6995089Z D: int, 2025-05-07T20:33:11.6995306Z scale_ub: Optional[float], 2025-05-07T20:33:11.6995567Z contiguous: bool, 2025-05-07T20:33:11.6995805Z compiled: bool, 2025-05-07T20:33:11.6996028Z ) -> None: 2025-05-07T20:33:11.6996235Z torch.manual_seed(2025) 2025-05-07T20:33:11.6996471Z 2025-05-07T20:33:11.6996736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.6997081Z 2025-05-07T20:33:11.6997270Z x_sign = torch.sign(x) 2025-05-07T20:33:11.6997619Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.6997924Z x = x_sign * x_clamp 2025-05-07T20:33:11.6998158Z x0 = x[:, :D] 2025-05-07T20:33:11.6998374Z x1 = x[:, D:] 2025-05-07T20:33:11.6998582Z 2025-05-07T20:33:11.6998772Z if contiguous: 2025-05-07T20:33:11.6999006Z x0 = x0.contiguous() 2025-05-07T20:33:11.6999262Z x1 = x1.contiguous() 2025-05-07T20:33:11.6999524Z 2025-05-07T20:33:11.6999742Z if scale_ub is not None: 2025-05-07T20:33:11.7000009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.7000332Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.7000635Z ) 2025-05-07T20:33:11.7000830Z else: 2025-05-07T20:33:11.7001037Z scale_ub_tensor = None 2025-05-07T20:33:11.7001287Z 2025-05-07T20:33:11.7001516Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.7001820Z op = silu_mul_quant 2025-05-07T20:33:11.7002075Z if compiled: 2025-05-07T20:33:11.7002318Z op = torch.compile(op) 2025-05-07T20:33:11.7002606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.7002879Z 2025-05-07T20:33:11.7003071Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.7003231Z 2025-05-07T20:33:11.7003330Z moe/activation_test.py:117: 2025-05-07T20:33:11.7003617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.7003944Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.7004228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.7004901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.7005583Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.7006118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.7006792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.7007442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.7007968Z kernel = self.compile( 2025-05-07T20:33:11.7008503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.7009143Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.7009539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.7009770Z 2025-05-07T20:33:11.7010000Z self = 2025-05-07T20:33:11.7011085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.7012516Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1979a80>} 2025-05-07T20:33:11.7013937Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.7014984Z context = 2025-05-07T20:33:11.7015263Z 2025-05-07T20:33:11.7015428Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.7015935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.7016395Z module_map=module_map) 2025-05-07T20:33:11.7016765Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.7017116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.7017415Z E ^ 2025-05-07T20:33:11.7017873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.7018313Z 2025-05-07T20:33:11.7018728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.7019232Z 2025-05-07T20:33:11.7019341Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.7019751Z self=, 2025-05-07T20:33:11.7020148Z T=1, 2025-05-07T20:33:11.7020331Z D=7168, 2025-05-07T20:33:11.7020520Z scale_ub=1200.0, 2025-05-07T20:33:11.7020741Z contiguous=False, 2025-05-07T20:33:11.7020968Z compiled=False, 2025-05-07T20:33:11.7021172Z ) 2025-05-07T20:33:11.7021492Z self = 2025-05-07T20:33:11.7021983Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.7022248Z 2025-05-07T20:33:11.7022328Z @given( 2025-05-07T20:33:11.7022561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.7022874Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.7023176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.7023524Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.7023854Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.7024140Z ) 2025-05-07T20:33:11.7024485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.7024916Z def test_silu_mul_quant( 2025-05-07T20:33:11.7025159Z self, 2025-05-07T20:33:11.7025355Z T: int, 2025-05-07T20:33:11.7025548Z D: int, 2025-05-07T20:33:11.7025769Z scale_ub: Optional[float], 2025-05-07T20:33:11.7026035Z contiguous: bool, 2025-05-07T20:33:11.7026276Z compiled: bool, 2025-05-07T20:33:11.7026504Z ) -> None: 2025-05-07T20:33:11.7026722Z torch.manual_seed(2025) 2025-05-07T20:33:11.7026959Z 2025-05-07T20:33:11.7027230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.7027575Z 2025-05-07T20:33:11.7027771Z x_sign = torch.sign(x) 2025-05-07T20:33:11.7028055Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.7028363Z x = x_sign * x_clamp 2025-05-07T20:33:11.7028604Z x0 = x[:, :D] 2025-05-07T20:33:11.7028816Z x1 = x[:, D:] 2025-05-07T20:33:11.7029024Z 2025-05-07T20:33:11.7029214Z if contiguous: 2025-05-07T20:33:11.7029449Z x0 = x0.contiguous() 2025-05-07T20:33:11.7029739Z x1 = x1.contiguous() 2025-05-07T20:33:11.7030005Z 2025-05-07T20:33:11.7030194Z if scale_ub is not None: 2025-05-07T20:33:11.7030516Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.7030847Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.7031181Z ) 2025-05-07T20:33:11.7031371Z else: 2025-05-07T20:33:11.7031578Z scale_ub_tensor = None 2025-05-07T20:33:11.7031825Z 2025-05-07T20:33:11.7032056Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.7032370Z op = silu_mul_quant 2025-05-07T20:33:11.7032619Z if compiled: 2025-05-07T20:33:11.7032863Z op = torch.compile(op) 2025-05-07T20:33:11.7033203Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.7033474Z 2025-05-07T20:33:11.7033660Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.7033823Z 2025-05-07T20:33:11.7033920Z moe/activation_test.py:117: 2025-05-07T20:33:11.7034210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.7034530Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.7034811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.7035536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.7036214Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.7036739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.7037413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.7038065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.7038587Z kernel = self.compile( 2025-05-07T20:33:11.7039122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.7039789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.7040216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.7040441Z 2025-05-07T20:33:11.7040653Z self = 2025-05-07T20:33:11.7041715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.7043058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552a400e0>} 2025-05-07T20:33:11.7044378Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.7045382Z context = 2025-05-07T20:33:11.7045678Z 2025-05-07T20:33:11.7045844Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.7048936Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.7049396Z module_map=module_map) 2025-05-07T20:33:11.7049767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.7050121Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.7050388Z E ^ 2025-05-07T20:33:11.7050843Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.7051295Z 2025-05-07T20:33:11.7051710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.7052217Z 2025-05-07T20:33:11.8765340Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8766900Z self=, 2025-05-07T20:33:11.8767781Z T=4096, 2025-05-07T20:33:11.8768159Z D=7168, 2025-05-07T20:33:11.8768771Z scale_ub=1200.0, 2025-05-07T20:33:11.8769270Z contiguous=False, 2025-05-07T20:33:11.8769706Z compiled=True, 2025-05-07T20:33:11.8769980Z ) 2025-05-07T20:33:11.8770343Z self = 2025-05-07T20:33:11.8770835Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8771111Z 2025-05-07T20:33:11.8771186Z @given( 2025-05-07T20:33:11.8771490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8771792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8772099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8772425Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8772749Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8773104Z ) 2025-05-07T20:33:11.8773450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8773888Z def test_silu_mul_quant( 2025-05-07T20:33:11.8774192Z self, 2025-05-07T20:33:11.8774391Z T: int, 2025-05-07T20:33:11.8774584Z D: int, 2025-05-07T20:33:11.8774794Z scale_ub: Optional[float], 2025-05-07T20:33:11.8775061Z contiguous: bool, 2025-05-07T20:33:11.8775296Z compiled: bool, 2025-05-07T20:33:11.8775515Z ) -> None: 2025-05-07T20:33:11.8775732Z torch.manual_seed(2025) 2025-05-07T20:33:11.8775976Z 2025-05-07T20:33:11.8776244Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8776582Z 2025-05-07T20:33:11.8776776Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8777059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8777364Z x = x_sign * x_clamp 2025-05-07T20:33:11.8777602Z x0 = x[:, :D] 2025-05-07T20:33:11.8777809Z x1 = x[:, D:] 2025-05-07T20:33:11.8778009Z 2025-05-07T20:33:11.8778188Z if contiguous: 2025-05-07T20:33:11.8778419Z x0 = x0.contiguous() 2025-05-07T20:33:11.8778675Z x1 = x1.contiguous() 2025-05-07T20:33:11.8778910Z 2025-05-07T20:33:11.8779102Z if scale_ub is not None: 2025-05-07T20:33:11.8779370Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8779704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8780013Z ) 2025-05-07T20:33:11.8780199Z else: 2025-05-07T20:33:11.8780413Z scale_ub_tensor = None 2025-05-07T20:33:11.8780662Z 2025-05-07T20:33:11.8780889Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8781198Z op = silu_mul_quant 2025-05-07T20:33:11.8781445Z if compiled: 2025-05-07T20:33:11.8781686Z op = torch.compile(op) 2025-05-07T20:33:11.8781980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8782259Z 2025-05-07T20:33:11.8782445Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8782610Z 2025-05-07T20:33:11.8782714Z moe/activation_test.py:117: 2025-05-07T20:33:11.8783123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8783450Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8783721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8784274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8784826Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8785471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8786148Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8792118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8792833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8793570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8794107Z kernel = self.compile( 2025-05-07T20:33:11.8794645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8795290Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8795681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8795950Z 2025-05-07T20:33:11.8796155Z self = 2025-05-07T20:33:11.8797213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8798616Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552a41300>} 2025-05-07T20:33:11.8799982Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8800981Z context = 2025-05-07T20:33:11.8801263Z 2025-05-07T20:33:11.8801436Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8801940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8802402Z module_map=module_map) 2025-05-07T20:33:11.8802767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8803110Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8803365Z E ^ 2025-05-07T20:33:11.8803829Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8804272Z 2025-05-07T20:33:11.8804685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8805184Z 2025-05-07T20:33:11.8805285Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8805691Z self=, 2025-05-07T20:33:11.8806087Z T=128, 2025-05-07T20:33:11.8806270Z D=7168, 2025-05-07T20:33:11.8806459Z scale_ub=1200.0, 2025-05-07T20:33:11.8806684Z contiguous=False, 2025-05-07T20:33:11.8806912Z compiled=True, 2025-05-07T20:33:11.8807105Z ) 2025-05-07T20:33:11.9699243Z self = 2025-05-07T20:33:11.9699801Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.9700081Z 2025-05-07T20:33:11.9700161Z @given( 2025-05-07T20:33:11.9700405Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9700846Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9701153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9701486Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9701809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9702095Z ) 2025-05-07T20:33:11.9702452Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9702901Z def test_silu_mul_quant( 2025-05-07T20:33:11.9703142Z self, 2025-05-07T20:33:11.9703340Z T: int, 2025-05-07T20:33:11.9703538Z D: int, 2025-05-07T20:33:11.9703752Z scale_ub: Optional[float], 2025-05-07T20:33:11.9704027Z contiguous: bool, 2025-05-07T20:33:11.9704267Z compiled: bool, 2025-05-07T20:33:11.9704486Z ) -> None: 2025-05-07T20:33:11.9704700Z torch.manual_seed(2025) 2025-05-07T20:33:11.9704943Z 2025-05-07T20:33:11.9705283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9705633Z 2025-05-07T20:33:11.9705829Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9706114Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9706421Z x = x_sign * x_clamp 2025-05-07T20:33:11.9706660Z x0 = x[:, :D] 2025-05-07T20:33:11.9706868Z x1 = x[:, D:] 2025-05-07T20:33:11.9707142Z 2025-05-07T20:33:11.9707327Z if contiguous: 2025-05-07T20:33:11.9707557Z x0 = x0.contiguous() 2025-05-07T20:33:11.9707820Z x1 = x1.contiguous() 2025-05-07T20:33:11.9708065Z 2025-05-07T20:33:11.9708261Z if scale_ub is not None: 2025-05-07T20:33:11.9708527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.9708862Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.9709170Z ) 2025-05-07T20:33:11.9709362Z else: 2025-05-07T20:33:11.9709576Z scale_ub_tensor = None 2025-05-07T20:33:11.9709897Z 2025-05-07T20:33:11.9710128Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.9710437Z op = silu_mul_quant 2025-05-07T20:33:11.9710683Z if compiled: 2025-05-07T20:33:11.9710921Z op = torch.compile(op) 2025-05-07T20:33:11.9711218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9711488Z 2025-05-07T20:33:11.9711679Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.9711845Z 2025-05-07T20:33:11.9711944Z moe/activation_test.py:117: 2025-05-07T20:33:11.9712238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9712563Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.9712835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9713401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.9713961Z return fn(*args, **kwargs) 2025-05-07T20:33:11.9714622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.9715298Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.9715833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.9716507Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.9717161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.9717686Z kernel = self.compile( 2025-05-07T20:33:11.9718224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.9718865Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.9719257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9719546Z 2025-05-07T20:33:11.9719771Z self = 2025-05-07T20:33:11.9720868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.9722220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552a42020>} 2025-05-07T20:33:11.9723557Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.9724580Z context = 2025-05-07T20:33:11.9724872Z 2025-05-07T20:33:11.9725080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.9725599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.9726062Z module_map=module_map) 2025-05-07T20:33:11.9726426Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.9726780Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.9727077Z E ^ 2025-05-07T20:33:11.9727535Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.9727984Z 2025-05-07T20:33:11.9728393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.9728894Z 2025-05-07T20:33:11.9729002Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9729405Z self=, 2025-05-07T20:33:11.9729802Z T=2048, 2025-05-07T20:33:11.9730092Z D=7168, 2025-05-07T20:33:11.9730303Z scale_ub=None, 2025-05-07T20:33:11.9730519Z contiguous=True, 2025-05-07T20:33:11.9730741Z compiled=True, 2025-05-07T20:33:11.9730938Z ) 2025-05-07T20:33:11.9731265Z self = 2025-05-07T20:33:11.9731750Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.9732012Z 2025-05-07T20:33:11.9732091Z @given( 2025-05-07T20:33:11.9732314Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9732623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9732923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9733323Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9733650Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9733943Z ) 2025-05-07T20:33:11.9734288Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9734730Z def test_silu_mul_quant( 2025-05-07T20:33:11.9734972Z self, 2025-05-07T20:33:11.9735165Z T: int, 2025-05-07T20:33:11.9735364Z D: int, 2025-05-07T20:33:11.9735583Z scale_ub: Optional[float], 2025-05-07T20:33:11.9735848Z contiguous: bool, 2025-05-07T20:33:11.9736083Z compiled: bool, 2025-05-07T20:33:11.9736305Z ) -> None: 2025-05-07T20:33:11.9736529Z torch.manual_seed(2025) 2025-05-07T20:33:11.9736786Z 2025-05-07T20:33:11.9737082Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9737459Z 2025-05-07T20:33:11.9737658Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9737970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9738312Z x = x_sign * x_clamp 2025-05-07T20:33:11.9738565Z x0 = x[:, :D] 2025-05-07T20:33:11.9738794Z x1 = x[:, D:] 2025-05-07T20:33:11.9739008Z 2025-05-07T20:33:11.9739199Z if contiguous: 2025-05-07T20:33:11.9739488Z x0 = x0.contiguous() 2025-05-07T20:33:11.9739753Z x1 = x1.contiguous() 2025-05-07T20:33:11.9739994Z 2025-05-07T20:33:11.9740195Z if scale_ub is not None: 2025-05-07T20:33:11.9740466Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.9740789Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.9741094Z ) 2025-05-07T20:33:11.9741287Z else: 2025-05-07T20:33:11.9741493Z scale_ub_tensor = None 2025-05-07T20:33:11.9741744Z 2025-05-07T20:33:11.9741970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.9742282Z op = silu_mul_quant 2025-05-07T20:33:11.9742523Z if compiled: 2025-05-07T20:33:11.9742768Z op = torch.compile(op) 2025-05-07T20:33:11.9743061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9743339Z 2025-05-07T20:33:11.9743529Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.9743740Z 2025-05-07T20:33:11.9743840Z moe/activation_test.py:117: 2025-05-07T20:33:11.9744133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9744457Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.9744739Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9745289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.9745873Z return fn(*args, **kwargs) 2025-05-07T20:33:11.9746521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.9747192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.9747721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.9748387Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.9749072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.9749601Z kernel = self.compile( 2025-05-07T20:33:11.9750187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.9750828Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.9751223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9751447Z 2025-05-07T20:33:11.9751660Z self = 2025-05-07T20:33:11.9752715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.9754059Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552a43240>} 2025-05-07T20:33:11.9755368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.9756377Z context = 2025-05-07T20:33:11.9756665Z 2025-05-07T20:33:11.9756827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.9757336Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.9757793Z module_map=module_map) 2025-05-07T20:33:11.9758160Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.9758503Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.9758762Z E ^ 2025-05-07T20:33:11.9759388Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.9759930Z 2025-05-07T20:33:11.9760375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.9760874Z 2025-05-07T20:33:12.0437133Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.0438107Z self=, 2025-05-07T20:33:12.0439189Z T=16384, 2025-05-07T20:33:12.0439595Z D=5120, 2025-05-07T20:33:12.0439884Z scale_ub=None, 2025-05-07T20:33:12.0440147Z contiguous=False, 2025-05-07T20:33:12.0440396Z compiled=False, 2025-05-07T20:33:12.0440605Z ) 2025-05-07T20:33:12.0440934Z self = 2025-05-07T20:33:12.0441435Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:12.0441718Z 2025-05-07T20:33:12.0441917Z @given( 2025-05-07T20:33:12.0442156Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.0442481Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.0442792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.0443121Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.0443451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.0443837Z ) 2025-05-07T20:33:12.0444184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.0444633Z def test_silu_mul_quant( 2025-05-07T20:33:12.0444881Z self, 2025-05-07T20:33:12.0445080Z T: int, 2025-05-07T20:33:12.0445280Z D: int, 2025-05-07T20:33:12.0445503Z scale_ub: Optional[float], 2025-05-07T20:33:12.0445786Z contiguous: bool, 2025-05-07T20:33:12.0446029Z compiled: bool, 2025-05-07T20:33:12.0446266Z ) -> None: 2025-05-07T20:33:12.0446498Z torch.manual_seed(2025) 2025-05-07T20:33:12.0446808Z 2025-05-07T20:33:12.0447088Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.0447432Z 2025-05-07T20:33:12.0447623Z x_sign = torch.sign(x) 2025-05-07T20:33:12.0447921Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.0449934Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.0451823Z 2025-05-07T20:33:12.0451948Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:12.0452169Z 2025-05-07T20:33:12.0452274Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.0452695Z self=, 2025-05-07T20:33:12.0453202Z T=4096, 2025-05-07T20:33:12.0453402Z D=7168, 2025-05-07T20:33:12.0453599Z scale_ub=1200.0, 2025-05-07T20:33:12.0453824Z contiguous=True, 2025-05-07T20:33:12.0454052Z compiled=True, 2025-05-07T20:33:12.0454266Z ) 2025-05-07T20:33:12.0454590Z self = 2025-05-07T20:33:12.0455090Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:12.0455359Z 2025-05-07T20:33:12.0455447Z @given( 2025-05-07T20:33:12.0455680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.0455988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.0456293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.0456623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.0457032Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.0457329Z ) 2025-05-07T20:33:12.0457679Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.0458118Z def test_silu_mul_quant( 2025-05-07T20:33:12.0458362Z self, 2025-05-07T20:33:12.0458562Z T: int, 2025-05-07T20:33:12.0458760Z D: int, 2025-05-07T20:33:12.0458983Z scale_ub: Optional[float], 2025-05-07T20:33:12.0459497Z contiguous: bool, 2025-05-07T20:33:12.0459786Z compiled: bool, 2025-05-07T20:33:12.0460019Z ) -> None: 2025-05-07T20:33:12.0460241Z torch.manual_seed(2025) 2025-05-07T20:33:12.0460485Z 2025-05-07T20:33:12.0460755Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.0461100Z 2025-05-07T20:33:12.0461297Z x_sign = torch.sign(x) 2025-05-07T20:33:12.0461674Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.0463668Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.0465568Z 2025-05-07T20:33:12.0465691Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:12.0465898Z 2025-05-07T20:33:12.0466008Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.0466420Z self=, 2025-05-07T20:33:12.0466813Z T=16384, 2025-05-07T20:33:12.0467013Z D=7168, 2025-05-07T20:33:12.0467273Z scale_ub=None, 2025-05-07T20:33:12.0467485Z contiguous=False, 2025-05-07T20:33:12.0467713Z compiled=False, 2025-05-07T20:33:12.0467921Z ) 2025-05-07T20:33:12.0468235Z self = 2025-05-07T20:33:12.0468732Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:12.0469008Z 2025-05-07T20:33:12.0469103Z @given( 2025-05-07T20:33:12.0469331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.0469670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.0470007Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.0470337Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.0470662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.0470952Z ) 2025-05-07T20:33:12.0471315Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.0471755Z def test_silu_mul_quant( 2025-05-07T20:33:12.0472006Z self, 2025-05-07T20:33:12.0472203Z T: int, 2025-05-07T20:33:12.0472400Z D: int, 2025-05-07T20:33:12.0472626Z scale_ub: Optional[float], 2025-05-07T20:33:12.0472897Z contiguous: bool, 2025-05-07T20:33:12.0473143Z compiled: bool, 2025-05-07T20:33:12.0473364Z ) -> None: 2025-05-07T20:33:12.0473579Z torch.manual_seed(2025) 2025-05-07T20:33:12.0473821Z 2025-05-07T20:33:12.0474087Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.0476119Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.0478036Z 2025-05-07T20:33:12.0478156Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.0478365Z 2025-05-07T20:33:12.0478474Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.0478885Z self=, 2025-05-07T20:33:12.0479288Z T=2048, 2025-05-07T20:33:12.0479479Z D=7168, 2025-05-07T20:33:12.0479669Z scale_ub=1200.0, 2025-05-07T20:33:12.0479884Z contiguous=True, 2025-05-07T20:33:12.0480108Z compiled=True, 2025-05-07T20:33:12.0480310Z ) 2025-05-07T20:33:12.0480627Z self = 2025-05-07T20:33:12.0481115Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:12.0481383Z 2025-05-07T20:33:12.0481470Z @given( 2025-05-07T20:33:12.0481747Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.0482062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.0482365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.0482698Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.0483021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.0483312Z ) 2025-05-07T20:33:12.0483704Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.0484134Z def test_silu_mul_quant( 2025-05-07T20:33:12.0484378Z self, 2025-05-07T20:33:12.0484575Z T: int, 2025-05-07T20:33:12.0484766Z D: int, 2025-05-07T20:33:12.0484987Z scale_ub: Optional[float], 2025-05-07T20:33:12.0485257Z contiguous: bool, 2025-05-07T20:33:12.0485495Z compiled: bool, 2025-05-07T20:33:12.0485718Z ) -> None: 2025-05-07T20:33:12.0485939Z torch.manual_seed(2025) 2025-05-07T20:33:12.0486180Z 2025-05-07T20:33:12.0486497Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.0486843Z 2025-05-07T20:33:12.0487032Z x_sign = torch.sign(x) 2025-05-07T20:33:12.0487324Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.0489283Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.0491105Z 2025-05-07T20:33:12.0491224Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:12.0491435Z 2025-05-07T20:33:12.0491548Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.0491954Z self=, 2025-05-07T20:33:12.0492348Z T=2048, 2025-05-07T20:33:12.0492537Z D=7168, 2025-05-07T20:33:12.0492725Z scale_ub=None, 2025-05-07T20:33:12.0492936Z contiguous=True, 2025-05-07T20:33:12.0493217Z compiled=False, 2025-05-07T20:33:12.0493423Z ) 2025-05-07T20:33:12.1608647Z self = 2025-05-07T20:33:12.1609178Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:12.1609516Z 2025-05-07T20:33:12.1609625Z @given( 2025-05-07T20:33:12.1609941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.1610270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.1610570Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.1610899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.1611346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.1611627Z ) 2025-05-07T20:33:12.1611973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.1612408Z def test_silu_mul_quant( 2025-05-07T20:33:12.1612642Z self, 2025-05-07T20:33:12.1612838Z T: int, 2025-05-07T20:33:12.1613117Z D: int, 2025-05-07T20:33:12.1613333Z scale_ub: Optional[float], 2025-05-07T20:33:12.1613602Z contiguous: bool, 2025-05-07T20:33:12.1613840Z compiled: bool, 2025-05-07T20:33:12.1614060Z ) -> None: 2025-05-07T20:33:12.1614275Z torch.manual_seed(2025) 2025-05-07T20:33:12.1614514Z 2025-05-07T20:33:12.1614780Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.1615117Z 2025-05-07T20:33:12.1615313Z > x_sign = torch.sign(x) 2025-05-07T20:33:12.1617305Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.1619186Z 2025-05-07T20:33:12.1619308Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:12.1619516Z 2025-05-07T20:33:12.1619615Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.1620047Z self=, 2025-05-07T20:33:12.1620462Z T=1, 2025-05-07T20:33:12.1620635Z D=7168, 2025-05-07T20:33:12.1620819Z scale_ub=1200.0, 2025-05-07T20:33:12.1621036Z contiguous=True, 2025-05-07T20:33:12.1621311Z compiled=False, 2025-05-07T20:33:12.1621514Z ) 2025-05-07T20:33:12.1621832Z self = 2025-05-07T20:33:12.1622306Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:12.1622573Z 2025-05-07T20:33:12.1622650Z @given( 2025-05-07T20:33:12.1622876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.1629782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.1630136Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.1630488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.1630816Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.1631106Z ) 2025-05-07T20:33:12.1631457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.1631901Z def test_silu_mul_quant( 2025-05-07T20:33:12.1632137Z self, 2025-05-07T20:33:12.1632330Z T: int, 2025-05-07T20:33:12.1632532Z D: int, 2025-05-07T20:33:12.1632750Z scale_ub: Optional[float], 2025-05-07T20:33:12.1633014Z contiguous: bool, 2025-05-07T20:33:12.1633245Z compiled: bool, 2025-05-07T20:33:12.1633469Z ) -> None: 2025-05-07T20:33:12.1633683Z torch.manual_seed(2025) 2025-05-07T20:33:12.1633911Z 2025-05-07T20:33:12.1634174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.1634514Z 2025-05-07T20:33:12.1634700Z x_sign = torch.sign(x) 2025-05-07T20:33:12.1634984Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.1635286Z x = x_sign * x_clamp 2025-05-07T20:33:12.1635522Z x0 = x[:, :D] 2025-05-07T20:33:12.1635732Z x1 = x[:, D:] 2025-05-07T20:33:12.1635934Z 2025-05-07T20:33:12.1636114Z if contiguous: 2025-05-07T20:33:12.1636340Z x0 = x0.contiguous() 2025-05-07T20:33:12.1636595Z x1 = x1.contiguous() 2025-05-07T20:33:12.1636910Z 2025-05-07T20:33:12.1637100Z if scale_ub is not None: 2025-05-07T20:33:12.1637369Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.1637700Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.1637993Z ) 2025-05-07T20:33:12.1638183Z else: 2025-05-07T20:33:12.1638387Z scale_ub_tensor = None 2025-05-07T20:33:12.1638629Z 2025-05-07T20:33:12.1638861Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.1639175Z op = silu_mul_quant 2025-05-07T20:33:12.1639413Z if compiled: 2025-05-07T20:33:12.1639660Z op = torch.compile(op) 2025-05-07T20:33:12.1639955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.1640223Z 2025-05-07T20:33:12.1640411Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.1640571Z 2025-05-07T20:33:12.1640674Z moe/activation_test.py:117: 2025-05-07T20:33:12.1641012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.1641340Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.1641613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.1642300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.1642974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.1643516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.1644229Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.1644885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.1645403Z kernel = self.compile( 2025-05-07T20:33:12.1645940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.1646623Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.1647013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.1647246Z 2025-05-07T20:33:12.1647451Z self = 2025-05-07T20:33:12.1648519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.1649897Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1732520>} 2025-05-07T20:33:12.1651234Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.1652242Z context = 2025-05-07T20:33:12.1652522Z 2025-05-07T20:33:12.1652685Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.1653292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.1653756Z module_map=module_map) 2025-05-07T20:33:12.1654117Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.1654467Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.1654722Z E ^ 2025-05-07T20:33:12.1655179Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.1655619Z 2025-05-07T20:33:12.1656029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.1656539Z 2025-05-07T20:33:12.1656645Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.1657114Z self=, 2025-05-07T20:33:12.1657515Z T=128, 2025-05-07T20:33:12.1657700Z D=5120, 2025-05-07T20:33:12.1657891Z scale_ub=None, 2025-05-07T20:33:12.1658107Z contiguous=True, 2025-05-07T20:33:12.1658320Z compiled=False, 2025-05-07T20:33:12.1658525Z ) 2025-05-07T20:33:12.2321263Z self = 2025-05-07T20:33:12.2321787Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:12.2322059Z 2025-05-07T20:33:12.2322183Z @given( 2025-05-07T20:33:12.2322506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.2322934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.2323336Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.2323662Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.2324104Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.2324402Z ) 2025-05-07T20:33:12.2324749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.2325193Z def test_silu_mul_quant( 2025-05-07T20:33:12.2325433Z self, 2025-05-07T20:33:12.2325625Z T: int, 2025-05-07T20:33:12.2325820Z D: int, 2025-05-07T20:33:12.2326036Z scale_ub: Optional[float], 2025-05-07T20:33:12.2326369Z contiguous: bool, 2025-05-07T20:33:12.2326609Z compiled: bool, 2025-05-07T20:33:12.2326831Z ) -> None: 2025-05-07T20:33:12.2327042Z torch.manual_seed(2025) 2025-05-07T20:33:12.2327280Z 2025-05-07T20:33:12.2327551Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.2327884Z 2025-05-07T20:33:12.2328075Z x_sign = torch.sign(x) 2025-05-07T20:33:12.2328363Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.2328671Z x = x_sign * x_clamp 2025-05-07T20:33:12.2328977Z x0 = x[:, :D] 2025-05-07T20:33:12.2329197Z x1 = x[:, D:] 2025-05-07T20:33:12.2329405Z 2025-05-07T20:33:12.2329594Z if contiguous: 2025-05-07T20:33:12.2329840Z x0 = x0.contiguous() 2025-05-07T20:33:12.2330138Z x1 = x1.contiguous() 2025-05-07T20:33:12.2330373Z 2025-05-07T20:33:12.2330567Z if scale_ub is not None: 2025-05-07T20:33:12.2330837Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.2331165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.2331479Z ) 2025-05-07T20:33:12.2331665Z else: 2025-05-07T20:33:12.2331875Z scale_ub_tensor = None 2025-05-07T20:33:12.2332131Z 2025-05-07T20:33:12.2332368Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.2332675Z op = silu_mul_quant 2025-05-07T20:33:12.2332923Z if compiled: 2025-05-07T20:33:12.2333238Z op = torch.compile(op) 2025-05-07T20:33:12.2333537Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.2333819Z 2025-05-07T20:33:12.2334010Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.2334177Z 2025-05-07T20:33:12.2334279Z moe/activation_test.py:117: 2025-05-07T20:33:12.2334567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.2334889Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.2335173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.2335869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.2336563Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.2337099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.2337772Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.2338429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.2339030Z kernel = self.compile( 2025-05-07T20:33:12.2339562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.2340207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.2340606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.2340832Z 2025-05-07T20:33:12.2341037Z self = 2025-05-07T20:33:12.2342099Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.2343483Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1733420>} 2025-05-07T20:33:12.2344814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.2345830Z context = 2025-05-07T20:33:12.2346151Z 2025-05-07T20:33:12.2346323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.2346833Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.2347293Z module_map=module_map) 2025-05-07T20:33:12.2347660Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.2348015Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.2348271Z E ^ 2025-05-07T20:33:12.2348774Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.2349219Z 2025-05-07T20:33:12.2349641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.2350196Z 2025-05-07T20:33:12.2350306Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.2350713Z self=, 2025-05-07T20:33:12.2351112Z T=128, 2025-05-07T20:33:12.2351301Z D=7168, 2025-05-07T20:33:12.2351487Z scale_ub=None, 2025-05-07T20:33:12.2351699Z contiguous=True, 2025-05-07T20:33:12.2351926Z compiled=False, 2025-05-07T20:33:12.2352127Z ) 2025-05-07T20:33:12.2352442Z self = 2025-05-07T20:33:12.2352925Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:12.2353185Z 2025-05-07T20:33:12.2353263Z @given( 2025-05-07T20:33:12.2353524Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.2353833Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.2354135Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.2354456Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.2354777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.2355059Z ) 2025-05-07T20:33:12.2355401Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.2355835Z def test_silu_mul_quant( 2025-05-07T20:33:12.2356072Z self, 2025-05-07T20:33:12.2356264Z T: int, 2025-05-07T20:33:12.2356459Z D: int, 2025-05-07T20:33:12.2356677Z scale_ub: Optional[float], 2025-05-07T20:33:12.2356949Z contiguous: bool, 2025-05-07T20:33:12.2357182Z compiled: bool, 2025-05-07T20:33:12.2357409Z ) -> None: 2025-05-07T20:33:12.2357628Z torch.manual_seed(2025) 2025-05-07T20:33:12.2357925Z 2025-05-07T20:33:12.2358197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.2358539Z 2025-05-07T20:33:12.2358731Z x_sign = torch.sign(x) 2025-05-07T20:33:12.2359020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.2359504Z x = x_sign * x_clamp 2025-05-07T20:33:12.2359736Z x0 = x[:, :D] 2025-05-07T20:33:12.2359950Z x1 = x[:, D:] 2025-05-07T20:33:12.2360190Z 2025-05-07T20:33:12.2360389Z if contiguous: 2025-05-07T20:33:12.2360613Z x0 = x0.contiguous() 2025-05-07T20:33:12.2360874Z x1 = x1.contiguous() 2025-05-07T20:33:12.2361112Z 2025-05-07T20:33:12.2361303Z if scale_ub is not None: 2025-05-07T20:33:12.2361572Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.2361897Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.2362202Z ) 2025-05-07T20:33:12.2362393Z else: 2025-05-07T20:33:12.2362676Z scale_ub_tensor = None 2025-05-07T20:33:12.2362932Z 2025-05-07T20:33:12.2363163Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.2363473Z op = silu_mul_quant 2025-05-07T20:33:12.2363715Z if compiled: 2025-05-07T20:33:12.2363960Z op = torch.compile(op) 2025-05-07T20:33:12.2364252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.2364615Z 2025-05-07T20:33:12.2364810Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.2364971Z 2025-05-07T20:33:12.2365072Z moe/activation_test.py:117: 2025-05-07T20:33:12.2365357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.2365680Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.2365956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.2366633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.2367374Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.2367906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.2368580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.2369228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.2369756Z kernel = self.compile( 2025-05-07T20:33:12.2370289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.2370930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.2371317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.2371548Z 2025-05-07T20:33:12.2371751Z self = 2025-05-07T20:33:12.2372817Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.2374228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1a204a0>} 2025-05-07T20:33:12.2375545Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.2376552Z context = 2025-05-07T20:33:12.2376836Z 2025-05-07T20:33:12.2376998Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.2377510Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.2378040Z module_map=module_map) 2025-05-07T20:33:12.2378401Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.2378750Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.2379006Z E ^ 2025-05-07T20:33:12.2379459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.2379946Z 2025-05-07T20:33:12.2380369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.2380873Z 2025-05-07T20:33:12.2380981Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.2381381Z self=, 2025-05-07T20:33:12.2381782Z T=2048, 2025-05-07T20:33:12.2381973Z D=7168, 2025-05-07T20:33:12.2382165Z scale_ub=1200.0, 2025-05-07T20:33:12.2382379Z contiguous=True, 2025-05-07T20:33:12.2382945Z compiled=False, 2025-05-07T20:33:12.2383150Z ) 2025-05-07T20:33:12.3188234Z self = 2025-05-07T20:33:12.3189038Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:12.3189422Z 2025-05-07T20:33:12.3189532Z @given( 2025-05-07T20:33:12.3189787Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.3190285Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.3190597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.3190929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.3191252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.3191545Z ) 2025-05-07T20:33:12.3191902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.3192336Z def test_silu_mul_quant( 2025-05-07T20:33:12.3192584Z self, 2025-05-07T20:33:12.3192789Z T: int, 2025-05-07T20:33:12.3193075Z D: int, 2025-05-07T20:33:12.3193295Z scale_ub: Optional[float], 2025-05-07T20:33:12.3193570Z contiguous: bool, 2025-05-07T20:33:12.3193810Z compiled: bool, 2025-05-07T20:33:12.3194030Z ) -> None: 2025-05-07T20:33:12.3194245Z torch.manual_seed(2025) 2025-05-07T20:33:12.3194489Z 2025-05-07T20:33:12.3194754Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.3196796Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.3198635Z 2025-05-07T20:33:12.3198753Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.3198968Z 2025-05-07T20:33:12.3199070Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.3199478Z self=, 2025-05-07T20:33:12.3199870Z T=1, 2025-05-07T20:33:12.3200061Z D=5120, 2025-05-07T20:33:12.3200258Z scale_ub=1200.0, 2025-05-07T20:33:12.3200472Z contiguous=True, 2025-05-07T20:33:12.3200698Z compiled=False, 2025-05-07T20:33:12.3200902Z ) 2025-05-07T20:33:12.3201212Z self = 2025-05-07T20:33:12.3201691Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:12.3201953Z 2025-05-07T20:33:12.3202034Z @given( 2025-05-07T20:33:12.3202260Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.3202572Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.3202967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.3203299Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.3203620Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.3203906Z ) 2025-05-07T20:33:12.3204249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.3204695Z def test_silu_mul_quant( 2025-05-07T20:33:12.3204939Z self, 2025-05-07T20:33:12.3205138Z T: int, 2025-05-07T20:33:12.3205335Z D: int, 2025-05-07T20:33:12.3205555Z scale_ub: Optional[float], 2025-05-07T20:33:12.3205830Z contiguous: bool, 2025-05-07T20:33:12.3206067Z compiled: bool, 2025-05-07T20:33:12.3206292Z ) -> None: 2025-05-07T20:33:12.3206514Z torch.manual_seed(2025) 2025-05-07T20:33:12.3206749Z 2025-05-07T20:33:12.3207019Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.3207446Z 2025-05-07T20:33:12.3207640Z x_sign = torch.sign(x) 2025-05-07T20:33:12.3207933Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.3208244Z x = x_sign * x_clamp 2025-05-07T20:33:12.3208492Z x0 = x[:, :D] 2025-05-07T20:33:12.3208703Z x1 = x[:, D:] 2025-05-07T20:33:12.3208913Z 2025-05-07T20:33:12.3209104Z if contiguous: 2025-05-07T20:33:12.3209378Z x0 = x0.contiguous() 2025-05-07T20:33:12.3209634Z x1 = x1.contiguous() 2025-05-07T20:33:12.3209896Z 2025-05-07T20:33:12.3210108Z if scale_ub is not None: 2025-05-07T20:33:12.3210383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.3210715Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.3211016Z ) 2025-05-07T20:33:12.3211213Z else: 2025-05-07T20:33:12.3211423Z scale_ub_tensor = None 2025-05-07T20:33:12.3211670Z 2025-05-07T20:33:12.3211948Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.3212265Z op = silu_mul_quant 2025-05-07T20:33:12.3212510Z if compiled: 2025-05-07T20:33:12.3212761Z op = torch.compile(op) 2025-05-07T20:33:12.3213153Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.3213436Z 2025-05-07T20:33:12.3213624Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.3213791Z 2025-05-07T20:33:12.3213894Z moe/activation_test.py:117: 2025-05-07T20:33:12.3214194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.3214521Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.3214804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.3215498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.3216183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.3216726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.3217405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.3218068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.3218591Z kernel = self.compile( 2025-05-07T20:33:12.3219131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.3219801Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.3220235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.3220460Z 2025-05-07T20:33:12.3220668Z self = 2025-05-07T20:33:12.3221739Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.3223156Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1a21a80>} 2025-05-07T20:33:12.3224478Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.3225494Z context = 2025-05-07T20:33:12.3225780Z 2025-05-07T20:33:12.3225950Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.3226472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.3226941Z module_map=module_map) 2025-05-07T20:33:12.3227350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.3227709Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.3227972Z E ^ 2025-05-07T20:33:12.3228446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.3228892Z 2025-05-07T20:33:12.3229301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.3229858Z 2025-05-07T20:33:12.3229987Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.3230423Z self=, 2025-05-07T20:33:12.3230819Z T=2048, 2025-05-07T20:33:12.3231002Z D=5120, 2025-05-07T20:33:12.3231197Z scale_ub=None, 2025-05-07T20:33:12.3231418Z contiguous=True, 2025-05-07T20:33:12.3231640Z compiled=False, 2025-05-07T20:33:12.3231849Z ) 2025-05-07T20:33:12.3232213Z self = 2025-05-07T20:33:12.3232703Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:12.3232976Z 2025-05-07T20:33:12.3233054Z @given( 2025-05-07T20:33:12.3233284Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.3233614Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.3233914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.3234252Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.3234582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.3234876Z ) 2025-05-07T20:33:12.3235220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.3235659Z def test_silu_mul_quant( 2025-05-07T20:33:12.3244071Z self, 2025-05-07T20:33:12.3244287Z T: int, 2025-05-07T20:33:12.3244486Z D: int, 2025-05-07T20:33:12.3244712Z scale_ub: Optional[float], 2025-05-07T20:33:12.3245006Z contiguous: bool, 2025-05-07T20:33:12.3245247Z compiled: bool, 2025-05-07T20:33:12.3245483Z ) -> None: 2025-05-07T20:33:12.3245704Z torch.manual_seed(2025) 2025-05-07T20:33:12.3245947Z 2025-05-07T20:33:12.3246225Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.3246577Z 2025-05-07T20:33:12.3246775Z > x_sign = torch.sign(x) 2025-05-07T20:33:12.3248708Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.3250631Z 2025-05-07T20:33:12.3250755Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:12.3250977Z 2025-05-07T20:33:12.3251081Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.3251509Z self=, 2025-05-07T20:33:12.3251907Z T=16384, 2025-05-07T20:33:12.3252108Z D=5120, 2025-05-07T20:33:12.3252312Z scale_ub=None, 2025-05-07T20:33:12.3252533Z contiguous=True, 2025-05-07T20:33:12.3252763Z compiled=False, 2025-05-07T20:33:12.3253076Z ) 2025-05-07T20:33:12.4003688Z self = 2025-05-07T20:33:12.4004449Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:12.4004849Z 2025-05-07T20:33:12.4004965Z @given( 2025-05-07T20:33:12.4005294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.4005647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.4006076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.4006421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.4006756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.4007041Z ) 2025-05-07T20:33:12.4007402Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.4007844Z def test_silu_mul_quant( 2025-05-07T20:33:12.4008156Z self, 2025-05-07T20:33:12.4008358Z T: int, 2025-05-07T20:33:12.4008566Z D: int, 2025-05-07T20:33:12.4008783Z scale_ub: Optional[float], 2025-05-07T20:33:12.4009062Z contiguous: bool, 2025-05-07T20:33:12.4009312Z compiled: bool, 2025-05-07T20:33:12.4009537Z ) -> None: 2025-05-07T20:33:12.4009765Z torch.manual_seed(2025) 2025-05-07T20:33:12.4010057Z 2025-05-07T20:33:12.4010334Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.4012444Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.4014374Z 2025-05-07T20:33:12.4014494Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.4014714Z 2025-05-07T20:33:12.4014821Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.4015240Z self=, 2025-05-07T20:33:12.4015641Z T=4096, 2025-05-07T20:33:12.4015840Z D=5120, 2025-05-07T20:33:12.4016038Z scale_ub=None, 2025-05-07T20:33:12.4016260Z contiguous=True, 2025-05-07T20:33:12.4016487Z compiled=False, 2025-05-07T20:33:12.4016700Z ) 2025-05-07T20:33:12.4017024Z self = 2025-05-07T20:33:12.4017508Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:12.4017780Z 2025-05-07T20:33:12.4017861Z @given( 2025-05-07T20:33:12.4018090Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.4018403Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.4018714Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.4019043Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.4019365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.4019655Z ) 2025-05-07T20:33:12.4020017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.4020501Z def test_silu_mul_quant( 2025-05-07T20:33:12.4020741Z self, 2025-05-07T20:33:12.4021042Z T: int, 2025-05-07T20:33:12.4021245Z D: int, 2025-05-07T20:33:12.4021459Z scale_ub: Optional[float], 2025-05-07T20:33:12.4021737Z contiguous: bool, 2025-05-07T20:33:12.4021985Z compiled: bool, 2025-05-07T20:33:12.4022209Z ) -> None: 2025-05-07T20:33:12.4022429Z torch.manual_seed(2025) 2025-05-07T20:33:12.4022677Z 2025-05-07T20:33:12.4022946Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.4025012Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.4026848Z 2025-05-07T20:33:12.4026967Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.4027184Z 2025-05-07T20:33:12.4027291Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.4027707Z self=, 2025-05-07T20:33:12.4028101Z T=2048, 2025-05-07T20:33:12.4028341Z D=5120, 2025-05-07T20:33:12.4028539Z scale_ub=None, 2025-05-07T20:33:12.4028750Z contiguous=False, 2025-05-07T20:33:12.4028985Z compiled=False, 2025-05-07T20:33:12.4029197Z ) 2025-05-07T20:33:12.4029510Z self = 2025-05-07T20:33:12.4029999Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:12.4030272Z 2025-05-07T20:33:12.4030353Z @given( 2025-05-07T20:33:12.4030592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.4030950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.4031260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.4031593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.4031919Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.4032212Z ) 2025-05-07T20:33:12.4032562Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.4032997Z def test_silu_mul_quant( 2025-05-07T20:33:12.4033242Z self, 2025-05-07T20:33:12.4033436Z T: int, 2025-05-07T20:33:12.4033629Z D: int, 2025-05-07T20:33:12.4033852Z scale_ub: Optional[float], 2025-05-07T20:33:12.4034126Z contiguous: bool, 2025-05-07T20:33:12.4034361Z compiled: bool, 2025-05-07T20:33:12.4034588Z ) -> None: 2025-05-07T20:33:12.4034805Z torch.manual_seed(2025) 2025-05-07T20:33:12.4035051Z 2025-05-07T20:33:12.4035326Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.4037337Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.4039160Z 2025-05-07T20:33:12.4039282Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.4039490Z 2025-05-07T20:33:12.4039605Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.4040063Z self=, 2025-05-07T20:33:12.4040475Z T=4096, 2025-05-07T20:33:12.4040670Z D=7168, 2025-05-07T20:33:12.4040918Z scale_ub=None, 2025-05-07T20:33:12.4041131Z contiguous=True, 2025-05-07T20:33:12.4041359Z compiled=True, 2025-05-07T20:33:12.4041568Z ) 2025-05-07T20:33:12.4041882Z self = 2025-05-07T20:33:12.4042369Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.4042632Z 2025-05-07T20:33:12.4042727Z @given( 2025-05-07T20:33:12.4042952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.4043267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.4043577Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.4043898Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.4044234Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.4044525Z ) 2025-05-07T20:33:12.4044876Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.4045355Z def test_silu_mul_quant( 2025-05-07T20:33:12.4045605Z self, 2025-05-07T20:33:12.4045815Z T: int, 2025-05-07T20:33:12.4046011Z D: int, 2025-05-07T20:33:12.4046240Z scale_ub: Optional[float], 2025-05-07T20:33:12.4046519Z contiguous: bool, 2025-05-07T20:33:12.4046755Z compiled: bool, 2025-05-07T20:33:12.4046986Z ) -> None: 2025-05-07T20:33:12.4047201Z torch.manual_seed(2025) 2025-05-07T20:33:12.4047491Z 2025-05-07T20:33:12.4047758Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.4049809Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.4051690Z 2025-05-07T20:33:12.4051807Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.4052029Z 2025-05-07T20:33:12.4052130Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.4052542Z self=, 2025-05-07T20:33:12.4052935Z T=2048, 2025-05-07T20:33:12.4053194Z D=5120, 2025-05-07T20:33:12.4053389Z scale_ub=1200.0, 2025-05-07T20:33:12.4053606Z contiguous=False, 2025-05-07T20:33:12.4053837Z compiled=False, 2025-05-07T20:33:12.4054042Z ) 2025-05-07T20:33:12.4054353Z self = 2025-05-07T20:33:12.4054843Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:12.4055113Z 2025-05-07T20:33:12.4055193Z @given( 2025-05-07T20:33:12.4055422Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.4055731Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.4056034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.4056371Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.4056688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.4056975Z ) 2025-05-07T20:33:12.4057324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.4057752Z def test_silu_mul_quant( 2025-05-07T20:33:12.4057997Z self, 2025-05-07T20:33:12.4058196Z T: int, 2025-05-07T20:33:12.4058388Z D: int, 2025-05-07T20:33:12.4058608Z scale_ub: Optional[float], 2025-05-07T20:33:12.4058879Z contiguous: bool, 2025-05-07T20:33:12.4059117Z compiled: bool, 2025-05-07T20:33:12.4060120Z ) -> None: 2025-05-07T20:33:12.4060339Z torch.manual_seed(2025) 2025-05-07T20:33:12.4060591Z 2025-05-07T20:33:12.4060942Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.4062962Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.4064785Z 2025-05-07T20:33:12.4064902Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.4065111Z 2025-05-07T20:33:12.4065221Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.4065624Z self=, 2025-05-07T20:33:12.4066090Z T=4096, 2025-05-07T20:33:12.4066284Z D=7168, 2025-05-07T20:33:12.4066471Z scale_ub=1200.0, 2025-05-07T20:33:12.4066696Z contiguous=True, 2025-05-07T20:33:12.4066919Z compiled=False, 2025-05-07T20:33:12.4067122Z ) 2025-05-07T20:33:12.5134179Z self = 2025-05-07T20:33:12.5134923Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:12.5135458Z 2025-05-07T20:33:12.5135570Z @given( 2025-05-07T20:33:12.5135892Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.5136305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.5136608Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.5136935Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.5137253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.5137542Z ) 2025-05-07T20:33:12.5138002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.5138442Z def test_silu_mul_quant( 2025-05-07T20:33:12.5138674Z self, 2025-05-07T20:33:12.5138871Z T: int, 2025-05-07T20:33:12.5139067Z D: int, 2025-05-07T20:33:12.5139279Z scale_ub: Optional[float], 2025-05-07T20:33:12.5139553Z contiguous: bool, 2025-05-07T20:33:12.5139795Z compiled: bool, 2025-05-07T20:33:12.5140016Z ) -> None: 2025-05-07T20:33:12.5140239Z torch.manual_seed(2025) 2025-05-07T20:33:12.5140484Z 2025-05-07T20:33:12.5140750Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.5142776Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.5144612Z 2025-05-07T20:33:12.5144730Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.5144947Z 2025-05-07T20:33:12.5145049Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.5145458Z self=, 2025-05-07T20:33:12.5145848Z T=16384, 2025-05-07T20:33:12.5146048Z D=7168, 2025-05-07T20:33:12.5146243Z scale_ub=None, 2025-05-07T20:33:12.5146449Z contiguous=False, 2025-05-07T20:33:12.5146674Z compiled=True, 2025-05-07T20:33:12.5146874Z ) 2025-05-07T20:33:12.5147186Z self = 2025-05-07T20:33:12.5147674Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:12.5148020Z 2025-05-07T20:33:12.5148109Z @given( 2025-05-07T20:33:12.5148337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.5148645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.5148947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.5149279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.5149598Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.5149916Z ) 2025-05-07T20:33:12.5150286Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.5150717Z def test_silu_mul_quant( 2025-05-07T20:33:12.5150960Z self, 2025-05-07T20:33:12.5151160Z T: int, 2025-05-07T20:33:12.5151355Z D: int, 2025-05-07T20:33:12.5151572Z scale_ub: Optional[float], 2025-05-07T20:33:12.5151844Z contiguous: bool, 2025-05-07T20:33:12.5152077Z compiled: bool, 2025-05-07T20:33:12.5152302Z ) -> None: 2025-05-07T20:33:12.5152591Z torch.manual_seed(2025) 2025-05-07T20:33:12.5152835Z 2025-05-07T20:33:12.5153100Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.5155102Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.5156964Z 2025-05-07T20:33:12.5157078Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.5157284Z 2025-05-07T20:33:12.5157390Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.5157833Z self=, 2025-05-07T20:33:12.5158229Z T=4096, 2025-05-07T20:33:12.5158413Z D=7168, 2025-05-07T20:33:12.5158608Z scale_ub=None, 2025-05-07T20:33:12.5158815Z contiguous=True, 2025-05-07T20:33:12.5159036Z compiled=False, 2025-05-07T20:33:12.5159429Z ) 2025-05-07T20:33:12.5159744Z self = 2025-05-07T20:33:12.5160279Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:12.5160543Z 2025-05-07T20:33:12.5160627Z @given( 2025-05-07T20:33:12.5160854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.5161162Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.5161464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.5161781Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.5162106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.5162397Z ) 2025-05-07T20:33:12.5162744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.5163175Z def test_silu_mul_quant( 2025-05-07T20:33:12.5163414Z self, 2025-05-07T20:33:12.5163608Z T: int, 2025-05-07T20:33:12.5163799Z D: int, 2025-05-07T20:33:12.5164015Z scale_ub: Optional[float], 2025-05-07T20:33:12.5164291Z contiguous: bool, 2025-05-07T20:33:12.5164526Z compiled: bool, 2025-05-07T20:33:12.5164745Z ) -> None: 2025-05-07T20:33:12.5164960Z torch.manual_seed(2025) 2025-05-07T20:33:12.5165204Z 2025-05-07T20:33:12.5165471Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.5167476Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.5169833Z 2025-05-07T20:33:12.5169951Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.5170166Z 2025-05-07T20:33:12.5170275Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.5170683Z self=, 2025-05-07T20:33:12.5171081Z T=16384, 2025-05-07T20:33:12.5171278Z D=7168, 2025-05-07T20:33:12.5171463Z scale_ub=None, 2025-05-07T20:33:12.5171680Z contiguous=True, 2025-05-07T20:33:12.5171903Z compiled=False, 2025-05-07T20:33:12.5172131Z ) 2025-05-07T20:33:12.5172442Z self = 2025-05-07T20:33:12.5173076Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:12.5173356Z 2025-05-07T20:33:12.5173440Z @given( 2025-05-07T20:33:12.5173662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.5173975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.5174281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.5174605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.5174992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.5175276Z ) 2025-05-07T20:33:12.5175624Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.5176052Z def test_silu_mul_quant( 2025-05-07T20:33:12.5176289Z self, 2025-05-07T20:33:12.5176485Z T: int, 2025-05-07T20:33:12.5176683Z D: int, 2025-05-07T20:33:12.5176903Z scale_ub: Optional[float], 2025-05-07T20:33:12.5177171Z contiguous: bool, 2025-05-07T20:33:12.5177472Z compiled: bool, 2025-05-07T20:33:12.5177702Z ) -> None: 2025-05-07T20:33:12.5177915Z torch.manual_seed(2025) 2025-05-07T20:33:12.5178157Z 2025-05-07T20:33:12.5178431Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.5180491Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.5182316Z 2025-05-07T20:33:12.5182432Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.5182639Z 2025-05-07T20:33:12.5182755Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.5183160Z self=, 2025-05-07T20:33:12.5183560Z T=16384, 2025-05-07T20:33:12.5183752Z D=7168, 2025-05-07T20:33:12.5183938Z scale_ub=1200.0, 2025-05-07T20:33:12.5184161Z contiguous=True, 2025-05-07T20:33:12.5184383Z compiled=False, 2025-05-07T20:33:12.5184588Z ) 2025-05-07T20:33:12.5184907Z self = 2025-05-07T20:33:12.5185397Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:12.5185669Z 2025-05-07T20:33:12.5185755Z @given( 2025-05-07T20:33:12.5185976Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.5186284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.5186592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.5186913Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.5187292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.5187582Z ) 2025-05-07T20:33:12.5187924Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.5188353Z def test_silu_mul_quant( 2025-05-07T20:33:12.5188599Z self, 2025-05-07T20:33:12.5188801Z T: int, 2025-05-07T20:33:12.5188994Z D: int, 2025-05-07T20:33:12.5189222Z scale_ub: Optional[float], 2025-05-07T20:33:12.5189498Z contiguous: bool, 2025-05-07T20:33:12.5189733Z compiled: bool, 2025-05-07T20:33:12.5189957Z ) -> None: 2025-05-07T20:33:12.5190175Z torch.manual_seed(2025) 2025-05-07T20:33:12.5190413Z 2025-05-07T20:33:12.5190687Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.5192743Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.5194600Z 2025-05-07T20:33:12.5194723Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.5194929Z 2025-05-07T20:33:12.5195034Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.5195437Z self=, 2025-05-07T20:33:12.5195832Z T=128, 2025-05-07T20:33:12.5196020Z D=5120, 2025-05-07T20:33:12.5196206Z scale_ub=1200.0, 2025-05-07T20:33:12.5196430Z contiguous=False, 2025-05-07T20:33:12.5196655Z compiled=False, 2025-05-07T20:33:12.5196855Z ) 2025-05-07T20:33:12.6480546Z self = 2025-05-07T20:33:12.6481788Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:12.6482299Z 2025-05-07T20:33:12.6482441Z @given( 2025-05-07T20:33:12.6482858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.6483435Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.6483990Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.6484606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.6485214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.6485740Z ) 2025-05-07T20:33:12.6486380Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.6487186Z def test_silu_mul_quant( 2025-05-07T20:33:12.6487633Z self, 2025-05-07T20:33:12.6487992Z T: int, 2025-05-07T20:33:12.6488342Z D: int, 2025-05-07T20:33:12.6488757Z scale_ub: Optional[float], 2025-05-07T20:33:12.6489268Z contiguous: bool, 2025-05-07T20:33:12.6489697Z compiled: bool, 2025-05-07T20:33:12.6490115Z ) -> None: 2025-05-07T20:33:12.6490494Z torch.manual_seed(2025) 2025-05-07T20:33:12.6490764Z 2025-05-07T20:33:12.6491035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.6491380Z 2025-05-07T20:33:12.6491577Z x_sign = torch.sign(x) 2025-05-07T20:33:12.6491867Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.6499753Z x = x_sign * x_clamp 2025-05-07T20:33:12.6500065Z x0 = x[:, :D] 2025-05-07T20:33:12.6500299Z x1 = x[:, D:] 2025-05-07T20:33:12.6500518Z 2025-05-07T20:33:12.6500707Z if contiguous: 2025-05-07T20:33:12.6500949Z x0 = x0.contiguous() 2025-05-07T20:33:12.6501218Z x1 = x1.contiguous() 2025-05-07T20:33:12.6501464Z 2025-05-07T20:33:12.6501667Z if scale_ub is not None: 2025-05-07T20:33:12.6501952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.6502410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.6502719Z ) 2025-05-07T20:33:12.6502924Z else: 2025-05-07T20:33:12.6503140Z scale_ub_tensor = None 2025-05-07T20:33:12.6503394Z 2025-05-07T20:33:12.6503638Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.6503962Z op = silu_mul_quant 2025-05-07T20:33:12.6504211Z if compiled: 2025-05-07T20:33:12.6504470Z op = torch.compile(op) 2025-05-07T20:33:12.6504760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.6505024Z 2025-05-07T20:33:12.6505212Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.6505372Z 2025-05-07T20:33:12.6505471Z moe/activation_test.py:117: 2025-05-07T20:33:12.6505759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.6506085Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.6506475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.6507163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.6507855Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.6508396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.6509134Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.6509808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.6510367Z kernel = self.compile( 2025-05-07T20:33:12.6510909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.6511554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.6511986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.6512221Z 2025-05-07T20:33:12.6512427Z self = 2025-05-07T20:33:12.6513493Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.6514859Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a14107c0>} 2025-05-07T20:33:12.6516183Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.6517189Z context = 2025-05-07T20:33:12.6517483Z 2025-05-07T20:33:12.6517652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.6518167Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.6518636Z module_map=module_map) 2025-05-07T20:33:12.6519002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.6519358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.6519617Z E ^ 2025-05-07T20:33:12.6520074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.6520573Z 2025-05-07T20:33:12.6520984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.6521506Z 2025-05-07T20:33:12.6521610Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.6522031Z self=, 2025-05-07T20:33:12.6522472Z T=2048, 2025-05-07T20:33:12.6522669Z D=7168, 2025-05-07T20:33:12.6522865Z scale_ub=None, 2025-05-07T20:33:12.6523072Z contiguous=False, 2025-05-07T20:33:12.6523298Z compiled=False, 2025-05-07T20:33:12.6523503Z ) 2025-05-07T20:33:12.6523813Z self = 2025-05-07T20:33:12.6524303Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:12.6524576Z 2025-05-07T20:33:12.6524662Z @given( 2025-05-07T20:33:12.6524885Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.6525199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.6525505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.6525841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.6526164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.6526445Z ) 2025-05-07T20:33:12.6526837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.6527271Z def test_silu_mul_quant( 2025-05-07T20:33:12.6527516Z self, 2025-05-07T20:33:12.6527716Z T: int, 2025-05-07T20:33:12.6527908Z D: int, 2025-05-07T20:33:12.6528133Z scale_ub: Optional[float], 2025-05-07T20:33:12.6528406Z contiguous: bool, 2025-05-07T20:33:12.6528689Z compiled: bool, 2025-05-07T20:33:12.6528915Z ) -> None: 2025-05-07T20:33:12.6529135Z torch.manual_seed(2025) 2025-05-07T20:33:12.6529370Z 2025-05-07T20:33:12.6529640Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.6531701Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.6533592Z 2025-05-07T20:33:12.6533712Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.6533923Z 2025-05-07T20:33:12.6534039Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.6534441Z self=, 2025-05-07T20:33:12.6534830Z T=128, 2025-05-07T20:33:12.6535012Z D=7168, 2025-05-07T20:33:12.6535200Z scale_ub=1200.0, 2025-05-07T20:33:12.6535424Z contiguous=True, 2025-05-07T20:33:12.6535649Z compiled=True, 2025-05-07T20:33:12.6535852Z ) 2025-05-07T20:33:12.6835868Z self = 2025-05-07T20:33:12.6836623Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:12.6837001Z 2025-05-07T20:33:12.6837109Z @given( 2025-05-07T20:33:12.6837418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.6837829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.6838133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.6838468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.6838801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.6839079Z ) 2025-05-07T20:33:12.6839429Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.6839911Z def test_silu_mul_quant( 2025-05-07T20:33:12.6840174Z self, 2025-05-07T20:33:12.6840375Z T: int, 2025-05-07T20:33:12.6840578Z D: int, 2025-05-07T20:33:12.6840797Z scale_ub: Optional[float], 2025-05-07T20:33:12.6841070Z contiguous: bool, 2025-05-07T20:33:12.6841315Z compiled: bool, 2025-05-07T20:33:12.6841661Z ) -> None: 2025-05-07T20:33:12.6841876Z torch.manual_seed(2025) 2025-05-07T20:33:12.6842122Z 2025-05-07T20:33:12.6842400Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.6842738Z 2025-05-07T20:33:12.6842935Z x_sign = torch.sign(x) 2025-05-07T20:33:12.6843225Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.6843531Z x = x_sign * x_clamp 2025-05-07T20:33:12.6843778Z x0 = x[:, :D] 2025-05-07T20:33:12.6844004Z x1 = x[:, D:] 2025-05-07T20:33:12.6844206Z 2025-05-07T20:33:12.6844397Z if contiguous: 2025-05-07T20:33:12.6844639Z x0 = x0.contiguous() 2025-05-07T20:33:12.6844894Z x1 = x1.contiguous() 2025-05-07T20:33:12.6845141Z 2025-05-07T20:33:12.6845336Z if scale_ub is not None: 2025-05-07T20:33:12.6845604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.6846010Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.6846328Z ) 2025-05-07T20:33:12.6846527Z else: 2025-05-07T20:33:12.6846736Z scale_ub_tensor = None 2025-05-07T20:33:12.6846990Z 2025-05-07T20:33:12.6847226Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.6847542Z op = silu_mul_quant 2025-05-07T20:33:12.6847794Z if compiled: 2025-05-07T20:33:12.6848048Z op = torch.compile(op) 2025-05-07T20:33:12.6848414Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.6848690Z 2025-05-07T20:33:12.6848888Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.6849049Z 2025-05-07T20:33:12.6849148Z moe/activation_test.py:117: 2025-05-07T20:33:12.6849445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.6849769Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.6850066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.6850730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:12.6851285Z return fn(*args, **kwargs) 2025-05-07T20:33:12.6851934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.6852611Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.6853223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.6853901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.6854553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.6855079Z kernel = self.compile( 2025-05-07T20:33:12.6855612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.6856363Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.6856818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.6857080Z 2025-05-07T20:33:12.6857320Z self = 2025-05-07T20:33:12.6858626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.6860330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1411940>} 2025-05-07T20:33:12.6861648Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.6862666Z context = 2025-05-07T20:33:12.6863019Z 2025-05-07T20:33:12.6863189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.6863696Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.6864158Z module_map=module_map) 2025-05-07T20:33:12.6864523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.6864870Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.6865125Z E ^ 2025-05-07T20:33:12.6865584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.6866024Z 2025-05-07T20:33:12.6866442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.6866945Z 2025-05-07T20:33:12.6867118Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.6867538Z self=, 2025-05-07T20:33:12.6867939Z T=128, 2025-05-07T20:33:12.6868121Z D=7168, 2025-05-07T20:33:12.6868316Z scale_ub=1200.0, 2025-05-07T20:33:12.6868544Z contiguous=True, 2025-05-07T20:33:12.6868764Z compiled=False, 2025-05-07T20:33:12.6868973Z ) 2025-05-07T20:33:12.6869292Z self = 2025-05-07T20:33:12.6869870Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:12.6870165Z 2025-05-07T20:33:12.6870244Z @given( 2025-05-07T20:33:12.6870480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.6870788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.6871093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.6871420Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.6871753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.6872100Z ) 2025-05-07T20:33:12.6872450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.6872963Z def test_silu_mul_quant( 2025-05-07T20:33:12.6873215Z self, 2025-05-07T20:33:12.6873421Z T: int, 2025-05-07T20:33:12.6873633Z D: int, 2025-05-07T20:33:12.6873862Z scale_ub: Optional[float], 2025-05-07T20:33:12.6874161Z contiguous: bool, 2025-05-07T20:33:12.6874416Z compiled: bool, 2025-05-07T20:33:12.6874647Z ) -> None: 2025-05-07T20:33:12.6874874Z torch.manual_seed(2025) 2025-05-07T20:33:12.6875137Z 2025-05-07T20:33:12.6875424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.6875808Z 2025-05-07T20:33:12.6876008Z x_sign = torch.sign(x) 2025-05-07T20:33:12.6876322Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.6878818Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.6881215Z 2025-05-07T20:33:12.6881345Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:12.6881590Z 2025-05-07T20:33:12.6881698Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.6882168Z self=, 2025-05-07T20:33:12.6882621Z T=128, 2025-05-07T20:33:12.6882819Z D=5120, 2025-05-07T20:33:12.6883019Z scale_ub=1200.0, 2025-05-07T20:33:12.6883259Z contiguous=True, 2025-05-07T20:33:12.6883563Z compiled=True, 2025-05-07T20:33:12.6883773Z ) 2025-05-07T20:33:12.6884094Z self = 2025-05-07T20:33:12.6884571Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:12.6884837Z 2025-05-07T20:33:12.6884915Z @given( 2025-05-07T20:33:12.6885145Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.6885453Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.6885753Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.6886079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.6886402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.6886689Z ) 2025-05-07T20:33:12.6887039Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.6887478Z def test_silu_mul_quant( 2025-05-07T20:33:12.6887717Z self, 2025-05-07T20:33:12.6887964Z T: int, 2025-05-07T20:33:12.6888171Z D: int, 2025-05-07T20:33:12.6888386Z scale_ub: Optional[float], 2025-05-07T20:33:12.6888661Z contiguous: bool, 2025-05-07T20:33:12.6888902Z compiled: bool, 2025-05-07T20:33:12.6889118Z ) -> None: 2025-05-07T20:33:12.6889346Z torch.manual_seed(2025) 2025-05-07T20:33:12.6889589Z 2025-05-07T20:33:12.6889854Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.6890233Z 2025-05-07T20:33:12.6890426Z x_sign = torch.sign(x) 2025-05-07T20:33:12.6890713Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.6892706Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.6894575Z 2025-05-07T20:33:12.6894693Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:12.6894906Z 2025-05-07T20:33:12.6895008Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.6895418Z self=, 2025-05-07T20:33:12.6895810Z T=128, 2025-05-07T20:33:12.6896002Z D=7168, 2025-05-07T20:33:12.6896191Z scale_ub=None, 2025-05-07T20:33:12.6896400Z contiguous=True, 2025-05-07T20:33:12.6896620Z compiled=True, 2025-05-07T20:33:12.6896819Z ) 2025-05-07T20:33:12.9379303Z self = 2025-05-07T20:33:12.9379959Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.9380321Z 2025-05-07T20:33:12.9380420Z @given( 2025-05-07T20:33:12.9380660Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9380974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9381281Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9381616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9381946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9382232Z ) 2025-05-07T20:33:12.9382582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9383020Z def test_silu_mul_quant( 2025-05-07T20:33:12.9383261Z self, 2025-05-07T20:33:12.9383463Z T: int, 2025-05-07T20:33:12.9383668Z D: int, 2025-05-07T20:33:12.9383891Z scale_ub: Optional[float], 2025-05-07T20:33:12.9384164Z contiguous: bool, 2025-05-07T20:33:12.9384409Z compiled: bool, 2025-05-07T20:33:12.9384632Z ) -> None: 2025-05-07T20:33:12.9384859Z torch.manual_seed(2025) 2025-05-07T20:33:12.9385224Z 2025-05-07T20:33:12.9385501Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9387510Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.9389333Z 2025-05-07T20:33:12.9389454Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:12.9389668Z 2025-05-07T20:33:12.9405485Z FAILED 2025-05-07T20:33:12.9405754Z 2025-05-07T20:33:12.9406137Z =================================== FAILURES =================================== 2025-05-07T20:33:12.9406779Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:12.9407376Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:12.9408203Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:12.9408940Z | yield 2025-05-07T20:33:12.9409625Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:33:12.9410339Z | self._callTestMethod(testMethod) 2025-05-07T20:33:12.9411115Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:33:12.9411866Z | if method() is not None: 2025-05-07T20:33:12.9412198Z | ^^^^^^^^ 2025-05-07T20:33:12.9413242Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:12.9414234Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9414638Z | ^^^^^^^ 2025-05-07T20:33:12.9415386Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:12.9416436Z | raise the_error_hypothesis_found 2025-05-07T20:33:12.9417016Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:12.9417601Z +-+---------------- 1 ---------------- 2025-05-07T20:33:12.9418003Z | Traceback (most recent call last): 2025-05-07T20:33:12.9418965Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:12.9420010Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9420522Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:12.9423228Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.9425906Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:12.9426502Z | self=, 2025-05-07T20:33:12.9427063Z | T=2048, 2025-05-07T20:33:12.9427378Z | D=5120, # or any other generated value 2025-05-07T20:33:12.9427845Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:12.9428513Z | contiguous=True, # or any other generated value 2025-05-07T20:33:12.9429007Z | compiled=False, # or any other generated value 2025-05-07T20:33:12.9429408Z | ) 2025-05-07T20:33:12.9429693Z | 2025-05-07T20:33:12.9430398Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:12.9431217Z +---------------- 2 ---------------- 2025-05-07T20:33:12.9431608Z | Traceback (most recent call last): 2025-05-07T20:33:12.9432605Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:12.9433652Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9434142Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:12.9436904Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.9439604Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:12.9440241Z | self=, 2025-05-07T20:33:12.9440827Z | T=128, 2025-05-07T20:33:12.9441095Z | D=7168, 2025-05-07T20:33:12.9441375Z | scale_ub=None, 2025-05-07T20:33:12.9441703Z | contiguous=True, 2025-05-07T20:33:12.9442028Z | compiled=True, 2025-05-07T20:33:12.9442342Z | ) 2025-05-07T20:33:12.9442646Z | 2025-05-07T20:33:12.9443353Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:12.9444164Z +---------------- 3 ---------------- 2025-05-07T20:33:12.9444558Z | Traceback (most recent call last): 2025-05-07T20:33:12.9445497Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:12.9446554Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9447087Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:12.9449412Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:12.9451358Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:12.9451791Z | self=, 2025-05-07T20:33:12.9452198Z | T=128, 2025-05-07T20:33:12.9452398Z | D=5120, 2025-05-07T20:33:12.9470653Z | scale_ub=1200.0, 2025-05-07T20:33:12.9471003Z | contiguous=True, 2025-05-07T20:33:12.9471332Z | compiled=True, 2025-05-07T20:33:12.9471635Z | ) 2025-05-07T20:33:12.9471882Z | 2025-05-07T20:33:12.9472627Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:12.9473686Z +---------------- 4 ---------------- 2025-05-07T20:33:12.9474088Z | Traceback (most recent call last): 2025-05-07T20:33:12.9475064Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:12.9476011Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9476388Z | ^^^^^^^^ 2025-05-07T20:33:12.9477247Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:12.9478204Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9478656Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:12.9479868Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:12.9480957Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9481785Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:12.9482772Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9483365Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:12.9484357Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:12.9485405Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9486042Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:12.9487002Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:12.9487963Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9488469Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:12.9489280Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:12.9490063Z | fn() 2025-05-07T20:33:12.9490845Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:12.9491689Z | self.fn.run( 2025-05-07T20:33:12.9492401Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:12.9493287Z | kernel = self.compile( 2025-05-07T20:33:12.9493651Z | ^^^^^^^^^^^^^ 2025-05-07T20:33:12.9494451Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:12.9495409Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9495931Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:12.9496781Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:12.9497860Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9498504Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:12.9499027Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9499503Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9499869Z | ^ 2025-05-07T20:33:12.9500554Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9501377Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:12.9501911Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:12.9502611Z | self=, 2025-05-07T20:33:12.9503199Z | T=1, # or any other generated value 2025-05-07T20:33:12.9503633Z | D=5120, # or any other generated value 2025-05-07T20:33:12.9504112Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:12.9504605Z | contiguous=True, # or any other generated value 2025-05-07T20:33:12.9505096Z | compiled=True, # or any other generated value 2025-05-07T20:33:12.9505508Z | ) 2025-05-07T20:33:12.9505759Z | 2025-05-07T20:33:12.9506489Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:12.9507372Z +------------------------------------ 2025-05-07T20:33:12.9507877Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:12.9508397Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9508963Z self=, 2025-05-07T20:33:12.9509486Z T=1, 2025-05-07T20:33:12.9509732Z D=5120, 2025-05-07T20:33:12.9509994Z scale_ub=None, 2025-05-07T20:33:12.9510377Z contiguous=True, 2025-05-07T20:33:12.9510671Z compiled=True, 2025-05-07T20:33:12.9510939Z ) 2025-05-07T20:33:12.9511349Z self = 2025-05-07T20:33:12.9511965Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.9512292Z 2025-05-07T20:33:12.9512407Z @given( 2025-05-07T20:33:12.9512696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9513098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9513556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9513978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9514411Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9514792Z ) 2025-05-07T20:33:12.9515242Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9515823Z def test_silu_mul_quant( 2025-05-07T20:33:12.9516139Z self, 2025-05-07T20:33:12.9516387Z T: int, 2025-05-07T20:33:12.9516635Z D: int, 2025-05-07T20:33:12.9516918Z scale_ub: Optional[float], 2025-05-07T20:33:12.9517276Z contiguous: bool, 2025-05-07T20:33:12.9517576Z compiled: bool, 2025-05-07T20:33:12.9517866Z ) -> None: 2025-05-07T20:33:12.9518142Z torch.manual_seed(2025) 2025-05-07T20:33:12.9518450Z 2025-05-07T20:33:12.9518799Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9519247Z 2025-05-07T20:33:12.9519512Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9519896Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9520327Z x = x_sign * x_clamp 2025-05-07T20:33:12.9520667Z x0 = x[:, :D] 2025-05-07T20:33:12.9520964Z x1 = x[:, D:] 2025-05-07T20:33:12.9521260Z 2025-05-07T20:33:12.9521514Z if contiguous: 2025-05-07T20:33:12.9521816Z x0 = x0.contiguous() 2025-05-07T20:33:12.9522143Z x1 = x1.contiguous() 2025-05-07T20:33:12.9522449Z 2025-05-07T20:33:12.9522703Z if scale_ub is not None: 2025-05-07T20:33:12.9523073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9523521Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9523925Z ) 2025-05-07T20:33:12.9524183Z else: 2025-05-07T20:33:12.9524465Z scale_ub_tensor = None 2025-05-07T20:33:12.9524809Z 2025-05-07T20:33:12.9525123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9525605Z op = silu_mul_quant 2025-05-07T20:33:12.9525926Z if compiled: 2025-05-07T20:33:12.9526248Z op = torch.compile(op) 2025-05-07T20:33:12.9526629Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9526982Z 2025-05-07T20:33:12.9527230Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9527604Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9527979Z 2025-05-07T20:33:12.9528293Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9528727Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9529108Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9529511Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9529980Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9530384Z 2025-05-07T20:33:12.9530644Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9530905Z 2025-05-07T20:33:12.9531096Z moe/activation_test.py:126: 2025-05-07T20:33:12.9531480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9531906Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9532332Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9533457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9534521Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9535220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9536117Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9537061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9538154Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9539160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9540015Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9540822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9541515Z fn() 2025-05-07T20:33:12.9542224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9543027Z self.fn.run( 2025-05-07T20:33:12.9543689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9544367Z kernel = self.compile( 2025-05-07T20:33:12.9545056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9545905Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9546407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9546707Z 2025-05-07T20:33:12.9546966Z self = 2025-05-07T20:33:12.9548343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9550212Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057db01c60>} 2025-05-07T20:33:12.9552008Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9553489Z context = 2025-05-07T20:33:12.9553869Z 2025-05-07T20:33:12.9554079Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9554748Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9555351Z module_map=module_map) 2025-05-07T20:33:12.9555816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9556272Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9556615Z E ^ 2025-05-07T20:33:12.9557223Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9557838Z 2025-05-07T20:33:12.9558392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9559076Z 2025-05-07T20:33:12.9559635Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9560207Z self=, 2025-05-07T20:33:12.9560783Z T=2048, 2025-05-07T20:33:12.9561028Z D=5120, 2025-05-07T20:33:12.9561282Z scale_ub=1200.0, 2025-05-07T20:33:12.9561592Z contiguous=True, 2025-05-07T20:33:12.9561894Z compiled=False, 2025-05-07T20:33:12.9562168Z ) 2025-05-07T20:33:12.9562705Z self = 2025-05-07T20:33:12.9563388Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:12.9563757Z 2025-05-07T20:33:12.9563866Z @given( 2025-05-07T20:33:12.9564175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9564605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9565030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9565479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9566024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9566416Z ) 2025-05-07T20:33:12.9566881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9567477Z def test_silu_mul_quant( 2025-05-07T20:33:12.9567808Z self, 2025-05-07T20:33:12.9568072Z T: int, 2025-05-07T20:33:12.9568347Z D: int, 2025-05-07T20:33:12.9568654Z scale_ub: Optional[float], 2025-05-07T20:33:12.9569000Z contiguous: bool, 2025-05-07T20:33:12.9569305Z compiled: bool, 2025-05-07T20:33:12.9569591Z ) -> None: 2025-05-07T20:33:12.9569864Z torch.manual_seed(2025) 2025-05-07T20:33:12.9570234Z 2025-05-07T20:33:12.9570588Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9571025Z 2025-05-07T20:33:12.9571278Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9571659Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9572072Z x = x_sign * x_clamp 2025-05-07T20:33:12.9572388Z x0 = x[:, :D] 2025-05-07T20:33:12.9572677Z x1 = x[:, D:] 2025-05-07T20:33:12.9573092Z 2025-05-07T20:33:12.9573350Z if contiguous: 2025-05-07T20:33:12.9573658Z x0 = x0.contiguous() 2025-05-07T20:33:12.9573999Z x1 = x1.contiguous() 2025-05-07T20:33:12.9574313Z 2025-05-07T20:33:12.9574567Z if scale_ub is not None: 2025-05-07T20:33:12.9574944Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9575389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9575809Z ) 2025-05-07T20:33:12.9576075Z else: 2025-05-07T20:33:12.9576359Z scale_ub_tensor = None 2025-05-07T20:33:12.9576703Z 2025-05-07T20:33:12.9577016Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9577436Z op = silu_mul_quant 2025-05-07T20:33:12.9577775Z if compiled: 2025-05-07T20:33:12.9578119Z op = torch.compile(op) 2025-05-07T20:33:12.9578591Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9578952Z 2025-05-07T20:33:12.9579207Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9579425Z 2025-05-07T20:33:12.9579564Z moe/activation_test.py:117: 2025-05-07T20:33:12.9579947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9580384Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9580768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9581712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9582673Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9583414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9584335Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9585297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9586024Z kernel = self.compile( 2025-05-07T20:33:12.9586763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9587626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9588225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9588529Z 2025-05-07T20:33:12.9588802Z self = 2025-05-07T20:33:12.9590314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9592226Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057d958220>} 2025-05-07T20:33:12.9594062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9595434Z context = 2025-05-07T20:33:12.9595817Z 2025-05-07T20:33:12.9596047Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9596764Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9597374Z module_map=module_map) 2025-05-07T20:33:12.9597864Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9598347Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9598701Z E ^ 2025-05-07T20:33:12.9599367Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9600008Z 2025-05-07T20:33:12.9600633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9601329Z 2025-05-07T20:33:12.9601478Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9602016Z self=, 2025-05-07T20:33:12.9602562Z T=2048, 2025-05-07T20:33:12.9602815Z D=5120, 2025-05-07T20:33:12.9603075Z scale_ub=1200.0, 2025-05-07T20:33:12.9603366Z contiguous=True, 2025-05-07T20:33:12.9603670Z compiled=True, 2025-05-07T20:33:12.9603945Z ) 2025-05-07T20:33:12.9604367Z self = 2025-05-07T20:33:12.9605003Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:12.9605349Z 2025-05-07T20:33:12.9605475Z @given( 2025-05-07T20:33:12.9605852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9606290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9606704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9607138Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9607578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9607967Z ) 2025-05-07T20:33:12.9608436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9609035Z def test_silu_mul_quant( 2025-05-07T20:33:12.9609367Z self, 2025-05-07T20:33:12.9609620Z T: int, 2025-05-07T20:33:12.9609888Z D: int, 2025-05-07T20:33:12.9610021Z scale_ub: Optional[float], 2025-05-07T20:33:12.9610153Z contiguous: bool, 2025-05-07T20:33:12.9610269Z compiled: bool, 2025-05-07T20:33:12.9610383Z ) -> None: 2025-05-07T20:33:12.9610577Z torch.manual_seed(2025) 2025-05-07T20:33:12.9610687Z 2025-05-07T20:33:12.9610930Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9611034Z 2025-05-07T20:33:12.9611164Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9611348Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9611473Z x = x_sign * x_clamp 2025-05-07T20:33:12.9611585Z x0 = x[:, :D] 2025-05-07T20:33:12.9611758Z x1 = x[:, D:] 2025-05-07T20:33:12.9611861Z 2025-05-07T20:33:12.9611981Z if contiguous: 2025-05-07T20:33:12.9612115Z x0 = x0.contiguous() 2025-05-07T20:33:12.9612240Z x1 = x1.contiguous() 2025-05-07T20:33:12.9612348Z 2025-05-07T20:33:12.9612476Z if scale_ub is not None: 2025-05-07T20:33:12.9612626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9612825Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9612933Z ) 2025-05-07T20:33:12.9613180Z else: 2025-05-07T20:33:12.9613372Z scale_ub_tensor = None 2025-05-07T20:33:12.9613448Z 2025-05-07T20:33:12.9613582Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9613678Z op = silu_mul_quant 2025-05-07T20:33:12.9613762Z if compiled: 2025-05-07T20:33:12.9613862Z op = torch.compile(op) 2025-05-07T20:33:12.9613972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9614049Z 2025-05-07T20:33:12.9614151Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9614276Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9614347Z 2025-05-07T20:33:12.9614492Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9614594Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9614694Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9614821Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9614964Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9615042Z 2025-05-07T20:33:12.9615148Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9615153Z 2025-05-07T20:33:12.9615251Z moe/activation_test.py:126: 2025-05-07T20:33:12.9615389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9615496Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9615633Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9616192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9616293Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9616647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9616874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9617287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9617545Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9617917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9618086Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9618431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9618509Z fn() 2025-05-07T20:33:12.9618907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9618991Z self.fn.run( 2025-05-07T20:33:12.9619325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9619470Z kernel = self.compile( 2025-05-07T20:33:12.9619850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9620033Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9620187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9620193Z 2025-05-07T20:33:12.9620485Z self = 2025-05-07T20:33:12.9621261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9621755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057d9596c0>} 2025-05-07T20:33:12.9622591Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9622796Z context = 2025-05-07T20:33:12.9622801Z 2025-05-07T20:33:12.9622967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9623237Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9623348Z module_map=module_map) 2025-05-07T20:33:12.9623513Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9623623Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9623702Z E ^ 2025-05-07T20:33:12.9624058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9624063Z 2025-05-07T20:33:12.9624477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9624481Z 2025-05-07T20:33:12.9624584Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9624812Z self=, 2025-05-07T20:33:12.9624891Z T=16384, 2025-05-07T20:33:12.9624968Z D=7168, 2025-05-07T20:33:12.9625065Z scale_ub=1200.0, 2025-05-07T20:33:12.9625152Z contiguous=False, 2025-05-07T20:33:12.9625246Z compiled=False, 2025-05-07T20:33:12.9625323Z ) 2025-05-07T20:33:12.9625538Z self = 2025-05-07T20:33:12.9625726Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:12.9625731Z 2025-05-07T20:33:12.9625810Z @given( 2025-05-07T20:33:12.9625930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9626040Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9626205Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9626322Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9626446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9626523Z ) 2025-05-07T20:33:12.9626777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9626875Z def test_silu_mul_quant( 2025-05-07T20:33:12.9626955Z self, 2025-05-07T20:33:12.9627041Z T: int, 2025-05-07T20:33:12.9627120Z D: int, 2025-05-07T20:33:12.9627221Z scale_ub: Optional[float], 2025-05-07T20:33:12.9627319Z contiguous: bool, 2025-05-07T20:33:12.9627407Z compiled: bool, 2025-05-07T20:33:12.9627488Z ) -> None: 2025-05-07T20:33:12.9627588Z torch.manual_seed(2025) 2025-05-07T20:33:12.9627665Z 2025-05-07T20:33:12.9627833Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9627965Z 2025-05-07T20:33:12.9628061Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9628195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9628286Z x = x_sign * x_clamp 2025-05-07T20:33:12.9628368Z x0 = x[:, :D] 2025-05-07T20:33:12.9628457Z x1 = x[:, D:] 2025-05-07T20:33:12.9628532Z 2025-05-07T20:33:12.9628616Z if contiguous: 2025-05-07T20:33:12.9628756Z x0 = x0.contiguous() 2025-05-07T20:33:12.9628845Z x1 = x1.contiguous() 2025-05-07T20:33:12.9628916Z 2025-05-07T20:33:12.9629013Z if scale_ub is not None: 2025-05-07T20:33:12.9629118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9629252Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9629339Z ) 2025-05-07T20:33:12.9629417Z else: 2025-05-07T20:33:12.9629522Z scale_ub_tensor = None 2025-05-07T20:33:12.9629596Z 2025-05-07T20:33:12.9629768Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9629871Z op = silu_mul_quant 2025-05-07T20:33:12.9629958Z if compiled: 2025-05-07T20:33:12.9630058Z op = torch.compile(op) 2025-05-07T20:33:12.9630171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9630248Z 2025-05-07T20:33:12.9630341Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9630345Z 2025-05-07T20:33:12.9630453Z moe/activation_test.py:117: 2025-05-07T20:33:12.9630584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9630690Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9630790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9631284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9631394Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9631755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9631978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9632319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9632419Z kernel = self.compile( 2025-05-07T20:33:12.9632804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9632983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9633110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9633115Z 2025-05-07T20:33:12.9633327Z self = 2025-05-07T20:33:12.9634098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9634645Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c824720>} 2025-05-07T20:33:12.9635377Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9635568Z context = 2025-05-07T20:33:12.9635580Z 2025-05-07T20:33:12.9635744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9636000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9636118Z module_map=module_map) 2025-05-07T20:33:12.9636324Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9636429Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9636517Z E ^ 2025-05-07T20:33:12.9636868Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9636873Z 2025-05-07T20:33:12.9637289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9637333Z 2025-05-07T20:33:12.9637438Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9637660Z self=, 2025-05-07T20:33:12.9637750Z T=1, 2025-05-07T20:33:12.9637828Z D=7168, 2025-05-07T20:33:12.9637910Z scale_ub=None, 2025-05-07T20:33:12.9638003Z contiguous=True, 2025-05-07T20:33:12.9638086Z compiled=True, 2025-05-07T20:33:12.9638158Z ) 2025-05-07T20:33:12.9638421Z self = 2025-05-07T20:33:12.9638594Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.9638598Z 2025-05-07T20:33:12.9638683Z @given( 2025-05-07T20:33:12.9638805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9638908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9639030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9639150Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9639267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9639352Z ) 2025-05-07T20:33:12.9639600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9639695Z def test_silu_mul_quant( 2025-05-07T20:33:12.9639781Z self, 2025-05-07T20:33:12.9639863Z T: int, 2025-05-07T20:33:12.9639950Z D: int, 2025-05-07T20:33:12.9640070Z scale_ub: Optional[float], 2025-05-07T20:33:12.9640178Z contiguous: bool, 2025-05-07T20:33:12.9640291Z compiled: bool, 2025-05-07T20:33:12.9640371Z ) -> None: 2025-05-07T20:33:12.9640465Z torch.manual_seed(2025) 2025-05-07T20:33:12.9640546Z 2025-05-07T20:33:12.9640715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9640791Z 2025-05-07T20:33:12.9640891Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9641019Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9641110Z x = x_sign * x_clamp 2025-05-07T20:33:12.9641197Z x0 = x[:, :D] 2025-05-07T20:33:12.9641283Z x1 = x[:, D:] 2025-05-07T20:33:12.9641356Z 2025-05-07T20:33:12.9641451Z if contiguous: 2025-05-07T20:33:12.9641545Z x0 = x0.contiguous() 2025-05-07T20:33:12.9641643Z x1 = x1.contiguous() 2025-05-07T20:33:12.9641721Z 2025-05-07T20:33:12.9641813Z if scale_ub is not None: 2025-05-07T20:33:12.9641928Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9642112Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9642190Z ) 2025-05-07T20:33:12.9642275Z else: 2025-05-07T20:33:12.9642371Z scale_ub_tensor = None 2025-05-07T20:33:12.9642449Z 2025-05-07T20:33:12.9642584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9642673Z op = silu_mul_quant 2025-05-07T20:33:12.9642761Z if compiled: 2025-05-07T20:33:12.9642867Z op = torch.compile(op) 2025-05-07T20:33:12.9642975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9643059Z 2025-05-07T20:33:12.9643152Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9643274Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9643357Z 2025-05-07T20:33:12.9643493Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9643596Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9643752Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9643877Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9644019Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9644102Z 2025-05-07T20:33:12.9644205Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9644209Z 2025-05-07T20:33:12.9644314Z moe/activation_test.py:126: 2025-05-07T20:33:12.9644484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9644593Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9644736Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9645291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9645393Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9645798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9646023Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9646393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9646649Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9647026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9647199Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9647535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9647616Z fn() 2025-05-07T20:33:12.9648011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9648104Z self.fn.run( 2025-05-07T20:33:12.9648442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9648539Z kernel = self.compile( 2025-05-07T20:33:12.9648913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9649093Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9649223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9649228Z 2025-05-07T20:33:12.9649437Z self = 2025-05-07T20:33:12.9650198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9650696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c8242c0>} 2025-05-07T20:33:12.9651480Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9660333Z context = 2025-05-07T20:33:12.9660353Z 2025-05-07T20:33:12.9660556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9660831Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9660944Z module_map=module_map) 2025-05-07T20:33:12.9661110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9661221Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9661305Z E ^ 2025-05-07T20:33:12.9661862Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9661868Z 2025-05-07T20:33:12.9662300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9662305Z 2025-05-07T20:33:12.9662410Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9662709Z self=, 2025-05-07T20:33:12.9662792Z T=4096, 2025-05-07T20:33:12.9662873Z D=5120, 2025-05-07T20:33:12.9662967Z scale_ub=None, 2025-05-07T20:33:12.9663056Z contiguous=False, 2025-05-07T20:33:12.9663143Z compiled=False, 2025-05-07T20:33:12.9663226Z ) 2025-05-07T20:33:12.9663443Z self = 2025-05-07T20:33:12.9663617Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:12.9663635Z 2025-05-07T20:33:12.9663780Z @given( 2025-05-07T20:33:12.9663904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9664011Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9664129Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9664249Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9664371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9664449Z ) 2025-05-07T20:33:12.9664694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9664796Z def test_silu_mul_quant( 2025-05-07T20:33:12.9664873Z self, 2025-05-07T20:33:12.9664957Z T: int, 2025-05-07T20:33:12.9665035Z D: int, 2025-05-07T20:33:12.9665137Z scale_ub: Optional[float], 2025-05-07T20:33:12.9665237Z contiguous: bool, 2025-05-07T20:33:12.9665327Z compiled: bool, 2025-05-07T20:33:12.9665409Z ) -> None: 2025-05-07T20:33:12.9665520Z torch.manual_seed(2025) 2025-05-07T20:33:12.9665601Z 2025-05-07T20:33:12.9665771Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9665852Z 2025-05-07T20:33:12.9665949Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9666078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9666180Z x = x_sign * x_clamp 2025-05-07T20:33:12.9666269Z x0 = x[:, :D] 2025-05-07T20:33:12.9666355Z x1 = x[:, D:] 2025-05-07T20:33:12.9666438Z 2025-05-07T20:33:12.9666526Z if contiguous: 2025-05-07T20:33:12.9666626Z x0 = x0.contiguous() 2025-05-07T20:33:12.9666719Z x1 = x1.contiguous() 2025-05-07T20:33:12.9666797Z 2025-05-07T20:33:12.9666901Z if scale_ub is not None: 2025-05-07T20:33:12.9667010Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9667147Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9667233Z ) 2025-05-07T20:33:12.9667388Z else: 2025-05-07T20:33:12.9667489Z scale_ub_tensor = None 2025-05-07T20:33:12.9667575Z 2025-05-07T20:33:12.9667710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9667804Z op = silu_mul_quant 2025-05-07T20:33:12.9667900Z if compiled: 2025-05-07T20:33:12.9668005Z op = torch.compile(op) 2025-05-07T20:33:12.9668127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9668202Z 2025-05-07T20:33:12.9668295Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9668300Z 2025-05-07T20:33:12.9668406Z moe/activation_test.py:117: 2025-05-07T20:33:12.9668536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9668641Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9668751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9669297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9669401Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9669768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9670008Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9670387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9670525Z kernel = self.compile( 2025-05-07T20:33:12.9670906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9671092Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9671223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9671228Z 2025-05-07T20:33:12.9671452Z self = 2025-05-07T20:33:12.9672262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9672762Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057d937240>} 2025-05-07T20:33:12.9673508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9673699Z context = 2025-05-07T20:33:12.9673704Z 2025-05-07T20:33:12.9673879Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9674148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9674259Z module_map=module_map) 2025-05-07T20:33:12.9674430Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9674532Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9674619Z E ^ 2025-05-07T20:33:12.9674971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9674979Z 2025-05-07T20:33:12.9675387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9675392Z 2025-05-07T20:33:12.9675503Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9675726Z self=, 2025-05-07T20:33:12.9675813Z T=4096, 2025-05-07T20:33:12.9675893Z D=7168, 2025-05-07T20:33:12.9675976Z scale_ub=None, 2025-05-07T20:33:12.9676122Z contiguous=False, 2025-05-07T20:33:12.9676209Z compiled=False, 2025-05-07T20:33:12.9676287Z ) 2025-05-07T20:33:12.9676511Z self = 2025-05-07T20:33:12.9676683Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:12.9676688Z 2025-05-07T20:33:12.9676769Z @given( 2025-05-07T20:33:12.9676896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9677002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9677127Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9677251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9677370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9677456Z ) 2025-05-07T20:33:12.9677701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9677797Z def test_silu_mul_quant( 2025-05-07T20:33:12.9677920Z self, 2025-05-07T20:33:12.9678007Z T: int, 2025-05-07T20:33:12.9678085Z D: int, 2025-05-07T20:33:12.9678196Z scale_ub: Optional[float], 2025-05-07T20:33:12.9678288Z contiguous: bool, 2025-05-07T20:33:12.9678377Z compiled: bool, 2025-05-07T20:33:12.9678467Z ) -> None: 2025-05-07T20:33:12.9678561Z torch.manual_seed(2025) 2025-05-07T20:33:12.9678642Z 2025-05-07T20:33:12.9678853Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9678929Z 2025-05-07T20:33:12.9679029Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9679155Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9679244Z x = x_sign * x_clamp 2025-05-07T20:33:12.9679333Z x0 = x[:, :D] 2025-05-07T20:33:12.9679416Z x1 = x[:, D:] 2025-05-07T20:33:12.9679491Z 2025-05-07T20:33:12.9679583Z if contiguous: 2025-05-07T20:33:12.9679675Z x0 = x0.contiguous() 2025-05-07T20:33:12.9679812Z x1 = x1.contiguous() 2025-05-07T20:33:12.9679895Z 2025-05-07T20:33:12.9679989Z if scale_ub is not None: 2025-05-07T20:33:12.9680114Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9680277Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9680371Z ) 2025-05-07T20:33:12.9680458Z else: 2025-05-07T20:33:12.9680555Z scale_ub_tensor = None 2025-05-07T20:33:12.9680634Z 2025-05-07T20:33:12.9680772Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9680865Z op = silu_mul_quant 2025-05-07T20:33:12.9680951Z if compiled: 2025-05-07T20:33:12.9681062Z op = torch.compile(op) 2025-05-07T20:33:12.9681169Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9681248Z 2025-05-07T20:33:12.9681348Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9681353Z 2025-05-07T20:33:12.9681457Z moe/activation_test.py:117: 2025-05-07T20:33:12.9681600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9681701Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9681803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9682311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9682412Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9682770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9683001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9683339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9683445Z kernel = self.compile( 2025-05-07T20:33:12.9683828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9684049Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9684185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9684189Z 2025-05-07T20:33:12.9684394Z self = 2025-05-07T20:33:12.9685166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9685666Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c1918a0>} 2025-05-07T20:33:12.9686446Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9686650Z context = 2025-05-07T20:33:12.9686654Z 2025-05-07T20:33:12.9686823Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9687091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9687243Z module_map=module_map) 2025-05-07T20:33:12.9687406Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9687519Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9687600Z E ^ 2025-05-07T20:33:12.9687951Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9687966Z 2025-05-07T20:33:12.9688374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9688425Z 2025-05-07T20:33:12.9688532Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9688764Z self=, 2025-05-07T20:33:12.9688845Z T=128, 2025-05-07T20:33:12.9688928Z D=7168, 2025-05-07T20:33:12.9689023Z scale_ub=None, 2025-05-07T20:33:12.9689116Z contiguous=False, 2025-05-07T20:33:12.9689203Z compiled=True, 2025-05-07T20:33:12.9689291Z ) 2025-05-07T20:33:12.9689507Z self = 2025-05-07T20:33:12.9689684Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:12.9689689Z 2025-05-07T20:33:12.9689771Z @given( 2025-05-07T20:33:12.9689891Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9689997Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9690113Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9690240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9690369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9690462Z ) 2025-05-07T20:33:12.9690732Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9690834Z def test_silu_mul_quant( 2025-05-07T20:33:12.9690915Z self, 2025-05-07T20:33:12.9691004Z T: int, 2025-05-07T20:33:12.9691082Z D: int, 2025-05-07T20:33:12.9691182Z scale_ub: Optional[float], 2025-05-07T20:33:12.9691283Z contiguous: bool, 2025-05-07T20:33:12.9691371Z compiled: bool, 2025-05-07T20:33:12.9691451Z ) -> None: 2025-05-07T20:33:12.9691555Z torch.manual_seed(2025) 2025-05-07T20:33:12.9691633Z 2025-05-07T20:33:12.9691802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9691884Z 2025-05-07T20:33:12.9691978Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9692106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9692277Z x = x_sign * x_clamp 2025-05-07T20:33:12.9692357Z x0 = x[:, :D] 2025-05-07T20:33:12.9692445Z x1 = x[:, D:] 2025-05-07T20:33:12.9692517Z 2025-05-07T20:33:12.9692601Z if contiguous: 2025-05-07T20:33:12.9692700Z x0 = x0.contiguous() 2025-05-07T20:33:12.9692793Z x1 = x1.contiguous() 2025-05-07T20:33:12.9692869Z 2025-05-07T20:33:12.9693041Z if scale_ub is not None: 2025-05-07T20:33:12.9693149Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9693289Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9693375Z ) 2025-05-07T20:33:12.9693454Z else: 2025-05-07T20:33:12.9693551Z scale_ub_tensor = None 2025-05-07T20:33:12.9693634Z 2025-05-07T20:33:12.9693764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9693862Z op = silu_mul_quant 2025-05-07T20:33:12.9693993Z if compiled: 2025-05-07T20:33:12.9694101Z op = torch.compile(op) 2025-05-07T20:33:12.9694215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9694291Z 2025-05-07T20:33:12.9694386Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9694516Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9694593Z 2025-05-07T20:33:12.9694730Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9694884Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9694986Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9695109Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9695260Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9695337Z 2025-05-07T20:33:12.9695447Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9695451Z 2025-05-07T20:33:12.9695551Z moe/activation_test.py:126: 2025-05-07T20:33:12.9695723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9695844Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9695979Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9696535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9696646Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9697006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9697237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9697600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9697862Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9698247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9698418Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9698761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9698839Z fn() 2025-05-07T20:33:12.9699236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9699331Z self.fn.run( 2025-05-07T20:33:12.9699666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9699764Z kernel = self.compile( 2025-05-07T20:33:12.9700198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9700373Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9700518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9700566Z 2025-05-07T20:33:12.9700774Z self = 2025-05-07T20:33:12.9701540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9702049Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c0f5a80>} 2025-05-07T20:33:12.9702786Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9703023Z context = 2025-05-07T20:33:12.9703033Z 2025-05-07T20:33:12.9703201Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9703461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9703581Z module_map=module_map) 2025-05-07T20:33:12.9703747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9703987Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9704064Z E ^ 2025-05-07T20:33:12.9704416Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9704421Z 2025-05-07T20:33:12.9704835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9704839Z 2025-05-07T20:33:12.9704946Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9705223Z self=, 2025-05-07T20:33:12.9705305Z T=128, 2025-05-07T20:33:12.9705383Z D=7168, 2025-05-07T20:33:12.9705475Z scale_ub=None, 2025-05-07T20:33:12.9705563Z contiguous=False, 2025-05-07T20:33:12.9705650Z compiled=False, 2025-05-07T20:33:12.9705734Z ) 2025-05-07T20:33:12.9705950Z self = 2025-05-07T20:33:12.9706124Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:12.9706128Z 2025-05-07T20:33:12.9706215Z @given( 2025-05-07T20:33:12.9706337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9706444Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9706560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9706676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9706798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9706874Z ) 2025-05-07T20:33:12.9707128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9707233Z def test_silu_mul_quant( 2025-05-07T20:33:12.9707311Z self, 2025-05-07T20:33:12.9707388Z T: int, 2025-05-07T20:33:12.9707474Z D: int, 2025-05-07T20:33:12.9707576Z scale_ub: Optional[float], 2025-05-07T20:33:12.9707666Z contiguous: bool, 2025-05-07T20:33:12.9707763Z compiled: bool, 2025-05-07T20:33:12.9707849Z ) -> None: 2025-05-07T20:33:12.9707954Z torch.manual_seed(2025) 2025-05-07T20:33:12.9708030Z 2025-05-07T20:33:12.9708198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9708280Z 2025-05-07T20:33:12.9708375Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9708501Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9708598Z x = x_sign * x_clamp 2025-05-07T20:33:12.9708678Z x0 = x[:, :D] 2025-05-07T20:33:12.9708762Z x1 = x[:, D:] 2025-05-07T20:33:12.9708891Z 2025-05-07T20:33:12.9708976Z if contiguous: 2025-05-07T20:33:12.9709070Z x0 = x0.contiguous() 2025-05-07T20:33:12.9709166Z x1 = x1.contiguous() 2025-05-07T20:33:12.9709242Z 2025-05-07T20:33:12.9709335Z if scale_ub is not None: 2025-05-07T20:33:12.9709449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9709588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9709672Z ) 2025-05-07T20:33:12.9709748Z else: 2025-05-07T20:33:12.9709852Z scale_ub_tensor = None 2025-05-07T20:33:12.9709926Z 2025-05-07T20:33:12.9710075Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9710177Z op = silu_mul_quant 2025-05-07T20:33:12.9710283Z if compiled: 2025-05-07T20:33:12.9710392Z op = torch.compile(op) 2025-05-07T20:33:12.9710496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9710621Z 2025-05-07T20:33:12.9710721Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9710726Z 2025-05-07T20:33:12.9710824Z moe/activation_test.py:117: 2025-05-07T20:33:12.9710952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9711062Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9711160Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9711697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9711805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9712159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9712387Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9712723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9712859Z kernel = self.compile( 2025-05-07T20:33:12.9713243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9713418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9713555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9713562Z 2025-05-07T20:33:12.9713767Z self = 2025-05-07T20:33:12.9714529Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9715034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553f30860>} 2025-05-07T20:33:12.9715773Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9715968Z context = 2025-05-07T20:33:12.9715973Z 2025-05-07T20:33:12.9716139Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9716402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9716518Z module_map=module_map) 2025-05-07T20:33:12.9716678Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9716784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9716861Z E ^ 2025-05-07T20:33:12.9717212Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9717262Z 2025-05-07T20:33:12.9717677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9717681Z 2025-05-07T20:33:12.9717785Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9718010Z self=, 2025-05-07T20:33:12.9718086Z T=4096, 2025-05-07T20:33:12.9718169Z D=5120, 2025-05-07T20:33:12.9718262Z scale_ub=1200.0, 2025-05-07T20:33:12.9718348Z contiguous=True, 2025-05-07T20:33:12.9718432Z compiled=False, 2025-05-07T20:33:12.9718513Z ) 2025-05-07T20:33:12.9718726Z self = 2025-05-07T20:33:12.9718900Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:12.9718904Z 2025-05-07T20:33:12.9718987Z @given( 2025-05-07T20:33:12.9719105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9719252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9719376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9719495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9719615Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9719689Z ) 2025-05-07T20:33:12.9719932Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9720086Z def test_silu_mul_quant( 2025-05-07T20:33:12.9720176Z self, 2025-05-07T20:33:12.9720261Z T: int, 2025-05-07T20:33:12.9720362Z D: int, 2025-05-07T20:33:12.9720464Z scale_ub: Optional[float], 2025-05-07T20:33:12.9720554Z contiguous: bool, 2025-05-07T20:33:12.9720645Z compiled: bool, 2025-05-07T20:33:12.9720732Z ) -> None: 2025-05-07T20:33:12.9720827Z torch.manual_seed(2025) 2025-05-07T20:33:12.9720905Z 2025-05-07T20:33:12.9721142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9721223Z 2025-05-07T20:33:12.9721322Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9721448Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9721547Z x = x_sign * x_clamp 2025-05-07T20:33:12.9721630Z x0 = x[:, :D] 2025-05-07T20:33:12.9721717Z x1 = x[:, D:] 2025-05-07T20:33:12.9721799Z 2025-05-07T20:33:12.9721885Z if contiguous: 2025-05-07T20:33:12.9721980Z x0 = x0.contiguous() 2025-05-07T20:33:12.9722079Z x1 = x1.contiguous() 2025-05-07T20:33:12.9722154Z 2025-05-07T20:33:12.9722247Z if scale_ub is not None: 2025-05-07T20:33:12.9722359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9722494Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9722568Z ) 2025-05-07T20:33:12.9722651Z else: 2025-05-07T20:33:12.9722746Z scale_ub_tensor = None 2025-05-07T20:33:12.9722817Z 2025-05-07T20:33:12.9722960Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9723049Z op = silu_mul_quant 2025-05-07T20:33:12.9723143Z if compiled: 2025-05-07T20:33:12.9723243Z op = torch.compile(op) 2025-05-07T20:33:12.9723348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9723427Z 2025-05-07T20:33:12.9723519Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9723526Z 2025-05-07T20:33:12.9723621Z moe/activation_test.py:117: 2025-05-07T20:33:12.9723764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9723864Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9723963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9724461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9724561Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9724923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9725191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9725529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9725634Z kernel = self.compile( 2025-05-07T20:33:12.9726012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9726194Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9726320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9726324Z 2025-05-07T20:33:12.9726528Z self = 2025-05-07T20:33:12.9727335Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9727833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553f314e0>} 2025-05-07T20:33:12.9728569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9728798Z context = 2025-05-07T20:33:12.9728803Z 2025-05-07T20:33:12.9728968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9729231Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9729343Z module_map=module_map) 2025-05-07T20:33:12.9729551Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9729654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9729729Z E ^ 2025-05-07T20:33:12.9730086Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9730090Z 2025-05-07T20:33:12.9730496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9730503Z 2025-05-07T20:33:12.9730611Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9730832Z self=, 2025-05-07T20:33:12.9730910Z T=1, 2025-05-07T20:33:12.9730994Z D=5120, 2025-05-07T20:33:12.9731078Z scale_ub=None, 2025-05-07T20:33:12.9731165Z contiguous=True, 2025-05-07T20:33:12.9731254Z compiled=True, 2025-05-07T20:33:12.9731328Z ) 2025-05-07T20:33:12.9731548Z self = 2025-05-07T20:33:12.9731715Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.9731720Z 2025-05-07T20:33:12.9731794Z @given( 2025-05-07T20:33:12.9731923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9732021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9732135Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9732262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9732374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9732446Z ) 2025-05-07T20:33:12.9732696Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9732789Z def test_silu_mul_quant( 2025-05-07T20:33:12.9732867Z self, 2025-05-07T20:33:12.9732949Z T: int, 2025-05-07T20:33:12.9733100Z D: int, 2025-05-07T20:33:12.9733206Z scale_ub: Optional[float], 2025-05-07T20:33:12.9733370Z contiguous: bool, 2025-05-07T20:33:12.9733457Z compiled: bool, 2025-05-07T20:33:12.9733542Z ) -> None: 2025-05-07T20:33:12.9733637Z torch.manual_seed(2025) 2025-05-07T20:33:12.9733712Z 2025-05-07T20:33:12.9733886Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9733963Z 2025-05-07T20:33:12.9734055Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9734194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9734283Z x = x_sign * x_clamp 2025-05-07T20:33:12.9734365Z x0 = x[:, :D] 2025-05-07T20:33:12.9734452Z x1 = x[:, D:] 2025-05-07T20:33:12.9734523Z 2025-05-07T20:33:12.9734610Z if contiguous: 2025-05-07T20:33:12.9734714Z x0 = x0.contiguous() 2025-05-07T20:33:12.9734805Z x1 = x1.contiguous() 2025-05-07T20:33:12.9734887Z 2025-05-07T20:33:12.9734980Z if scale_ub is not None: 2025-05-07T20:33:12.9735136Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9735280Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9735357Z ) 2025-05-07T20:33:12.9735433Z else: 2025-05-07T20:33:12.9735537Z scale_ub_tensor = None 2025-05-07T20:33:12.9735613Z 2025-05-07T20:33:12.9735744Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9735888Z op = silu_mul_quant 2025-05-07T20:33:12.9735978Z if compiled: 2025-05-07T20:33:12.9736082Z op = torch.compile(op) 2025-05-07T20:33:12.9736191Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9736267Z 2025-05-07T20:33:12.9736367Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9736490Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9736567Z 2025-05-07T20:33:12.9736710Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9736816Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9736959Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9737094Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9737238Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9737311Z 2025-05-07T20:33:12.9737420Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9737425Z 2025-05-07T20:33:12.9737525Z moe/activation_test.py:126: 2025-05-07T20:33:12.9737659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9737769Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9737906Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9738466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9738568Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9738928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9739157Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9739519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9739782Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9740157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9740324Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9740663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9740742Z fn() 2025-05-07T20:33:12.9741146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9741282Z self.fn.run( 2025-05-07T20:33:12.9741616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9741719Z kernel = self.compile( 2025-05-07T20:33:12.9742095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9742268Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9742407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9742411Z 2025-05-07T20:33:12.9742614Z self = 2025-05-07T20:33:12.9743382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9743922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553f32d40>} 2025-05-07T20:33:12.9744656Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9744892Z context = 2025-05-07T20:33:12.9744897Z 2025-05-07T20:33:12.9745062Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9745327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9745436Z module_map=module_map) 2025-05-07T20:33:12.9745600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9745707Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9745832Z E ^ 2025-05-07T20:33:12.9746189Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9746194Z 2025-05-07T20:33:12.9746605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9746610Z 2025-05-07T20:33:12.9746713Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9746942Z self=, 2025-05-07T20:33:12.9747019Z T=2048, 2025-05-07T20:33:12.9747094Z D=5120, 2025-05-07T20:33:12.9747182Z scale_ub=None, 2025-05-07T20:33:12.9747266Z contiguous=True, 2025-05-07T20:33:12.9747356Z compiled=True, 2025-05-07T20:33:12.9747432Z ) 2025-05-07T20:33:12.9747648Z self = 2025-05-07T20:33:12.9747825Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.9747834Z 2025-05-07T20:33:12.9747912Z @given( 2025-05-07T20:33:12.9748033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9748137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9748251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9748368Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9748489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9748569Z ) 2025-05-07T20:33:12.9748819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9748913Z def test_silu_mul_quant( 2025-05-07T20:33:12.9748991Z self, 2025-05-07T20:33:12.9749073Z T: int, 2025-05-07T20:33:12.9749148Z D: int, 2025-05-07T20:33:12.9749248Z scale_ub: Optional[float], 2025-05-07T20:33:12.9749344Z contiguous: bool, 2025-05-07T20:33:12.9749431Z compiled: bool, 2025-05-07T20:33:12.9749512Z ) -> None: 2025-05-07T20:33:12.9749659Z torch.manual_seed(2025) 2025-05-07T20:33:12.9749734Z 2025-05-07T20:33:12.9749904Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9749991Z 2025-05-07T20:33:12.9750087Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9750216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9750304Z x = x_sign * x_clamp 2025-05-07T20:33:12.9750389Z x0 = x[:, :D] 2025-05-07T20:33:12.9750477Z x1 = x[:, D:] 2025-05-07T20:33:12.9750553Z 2025-05-07T20:33:12.9750639Z if contiguous: 2025-05-07T20:33:12.9750741Z x0 = x0.contiguous() 2025-05-07T20:33:12.9750836Z x1 = x1.contiguous() 2025-05-07T20:33:12.9750913Z 2025-05-07T20:33:12.9751010Z if scale_ub is not None: 2025-05-07T20:33:12.9751116Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9751256Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9751382Z ) 2025-05-07T20:33:12.9751464Z else: 2025-05-07T20:33:12.9751571Z scale_ub_tensor = None 2025-05-07T20:33:12.9751643Z 2025-05-07T20:33:12.9751774Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9751871Z op = silu_mul_quant 2025-05-07T20:33:12.9751957Z if compiled: 2025-05-07T20:33:12.9752059Z op = torch.compile(op) 2025-05-07T20:33:12.9752262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9752335Z 2025-05-07T20:33:12.9752426Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9752552Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9752629Z 2025-05-07T20:33:12.9752766Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9752874Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9752974Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9753148Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9753291Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9753366Z 2025-05-07T20:33:12.9753473Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9753477Z 2025-05-07T20:33:12.9753581Z moe/activation_test.py:126: 2025-05-07T20:33:12.9753712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9753825Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9753961Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9754526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9754630Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9754984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9755215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9755578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9755832Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9756206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9756375Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9756715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9756794Z fn() 2025-05-07T20:33:12.9757189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9757281Z self.fn.run( 2025-05-07T20:33:12.9757621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9757758Z kernel = self.compile( 2025-05-07T20:33:12.9758140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9758313Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9758446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9758453Z 2025-05-07T20:33:12.9758656Z self = 2025-05-07T20:33:12.9759674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9760275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f057c1ede40>} 2025-05-07T20:33:12.9761013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9761211Z context = 2025-05-07T20:33:12.9761217Z 2025-05-07T20:33:12.9761444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9761706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9761812Z module_map=module_map) 2025-05-07T20:33:12.9761974Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9762081Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9762157Z E ^ 2025-05-07T20:33:12.9762565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9762572Z 2025-05-07T20:33:12.9762986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9762991Z 2025-05-07T20:33:12.9763098Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9763323Z self=, 2025-05-07T20:33:12.9763405Z T=128, 2025-05-07T20:33:12.9763482Z D=5120, 2025-05-07T20:33:12.9763570Z scale_ub=None, 2025-05-07T20:33:12.9763654Z contiguous=True, 2025-05-07T20:33:12.9763735Z compiled=True, 2025-05-07T20:33:12.9763811Z ) 2025-05-07T20:33:12.9764025Z self = 2025-05-07T20:33:12.9764189Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.9764199Z 2025-05-07T20:33:12.9764275Z @given( 2025-05-07T20:33:12.9764395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9764507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9764622Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9764739Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9764858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9764935Z ) 2025-05-07T20:33:12.9765178Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9765283Z def test_silu_mul_quant( 2025-05-07T20:33:12.9765358Z self, 2025-05-07T20:33:12.9765434Z T: int, 2025-05-07T20:33:12.9766074Z D: int, 2025-05-07T20:33:12.9766175Z scale_ub: Optional[float], 2025-05-07T20:33:12.9766271Z contiguous: bool, 2025-05-07T20:33:12.9766358Z compiled: bool, 2025-05-07T20:33:12.9766434Z ) -> None: 2025-05-07T20:33:12.9766539Z torch.manual_seed(2025) 2025-05-07T20:33:12.9766612Z 2025-05-07T20:33:12.9766786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9766933Z 2025-05-07T20:33:12.9767025Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9767149Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9767243Z x = x_sign * x_clamp 2025-05-07T20:33:12.9767323Z x0 = x[:, :D] 2025-05-07T20:33:12.9767403Z x1 = x[:, D:] 2025-05-07T20:33:12.9767483Z 2025-05-07T20:33:12.9767569Z if contiguous: 2025-05-07T20:33:12.9767667Z x0 = x0.contiguous() 2025-05-07T20:33:12.9767758Z x1 = x1.contiguous() 2025-05-07T20:33:12.9767830Z 2025-05-07T20:33:12.9767927Z if scale_ub is not None: 2025-05-07T20:33:12.9768032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9768169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9768252Z ) 2025-05-07T20:33:12.9768327Z else: 2025-05-07T20:33:12.9768419Z scale_ub_tensor = None 2025-05-07T20:33:12.9768550Z 2025-05-07T20:33:12.9768685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9768776Z op = silu_mul_quant 2025-05-07T20:33:12.9768866Z if compiled: 2025-05-07T20:33:12.9768964Z op = torch.compile(op) 2025-05-07T20:33:12.9769075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9769148Z 2025-05-07T20:33:12.9769239Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9769405Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9769478Z 2025-05-07T20:33:12.9769615Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9769718Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9769817Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9769936Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9770077Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9770150Z 2025-05-07T20:33:12.9770300Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9770305Z 2025-05-07T20:33:12.9770403Z moe/activation_test.py:126: 2025-05-07T20:33:12.9770530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9770644Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9770775Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9771329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9771435Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9771786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9772012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9772377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9772630Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9773053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9773219Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9773559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9773641Z fn() 2025-05-07T20:33:12.9774034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9774121Z self.fn.run( 2025-05-07T20:33:12.9774453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9774545Z kernel = self.compile( 2025-05-07T20:33:12.9774929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9775171Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9775317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9775321Z 2025-05-07T20:33:12.9775551Z self = 2025-05-07T20:33:12.9776509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9777127Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552f0dd00>} 2025-05-07T20:33:12.9778080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9778303Z context = 2025-05-07T20:33:12.9778308Z 2025-05-07T20:33:12.9778489Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9778792Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9778955Z module_map=module_map) 2025-05-07T20:33:12.9779133Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9779244Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9779322Z E ^ 2025-05-07T20:33:12.9779743Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9779747Z 2025-05-07T20:33:12.9780284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9780292Z 2025-05-07T20:33:12.9780399Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9780655Z self=, 2025-05-07T20:33:12.9780732Z T=4096, 2025-05-07T20:33:12.9780808Z D=5120, 2025-05-07T20:33:12.9780897Z scale_ub=None, 2025-05-07T20:33:12.9780982Z contiguous=True, 2025-05-07T20:33:12.9781070Z compiled=True, 2025-05-07T20:33:12.9781147Z ) 2025-05-07T20:33:12.9781395Z self = 2025-05-07T20:33:12.9781579Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.9781583Z 2025-05-07T20:33:12.9781663Z @given( 2025-05-07T20:33:12.9781789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9781892Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9782022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9782151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9782281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9794986Z ) 2025-05-07T20:33:12.9795260Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9795354Z def test_silu_mul_quant( 2025-05-07T20:33:12.9795435Z self, 2025-05-07T20:33:12.9795518Z T: int, 2025-05-07T20:33:12.9795595Z D: int, 2025-05-07T20:33:12.9795696Z scale_ub: Optional[float], 2025-05-07T20:33:12.9795784Z contiguous: bool, 2025-05-07T20:33:12.9795866Z compiled: bool, 2025-05-07T20:33:12.9795952Z ) -> None: 2025-05-07T20:33:12.9796045Z torch.manual_seed(2025) 2025-05-07T20:33:12.9796117Z 2025-05-07T20:33:12.9796292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9796363Z 2025-05-07T20:33:12.9796461Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9796739Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9796827Z x = x_sign * x_clamp 2025-05-07T20:33:12.9796911Z x0 = x[:, :D] 2025-05-07T20:33:12.9796987Z x1 = x[:, D:] 2025-05-07T20:33:12.9797060Z 2025-05-07T20:33:12.9797148Z if contiguous: 2025-05-07T20:33:12.9797237Z x0 = x0.contiguous() 2025-05-07T20:33:12.9797327Z x1 = x1.contiguous() 2025-05-07T20:33:12.9797403Z 2025-05-07T20:33:12.9797493Z if scale_ub is not None: 2025-05-07T20:33:12.9797596Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9797732Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9797802Z ) 2025-05-07T20:33:12.9797881Z else: 2025-05-07T20:33:12.9797978Z scale_ub_tensor = None 2025-05-07T20:33:12.9798046Z 2025-05-07T20:33:12.9798180Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9798313Z op = silu_mul_quant 2025-05-07T20:33:12.9798401Z if compiled: 2025-05-07T20:33:12.9798514Z op = torch.compile(op) 2025-05-07T20:33:12.9798619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9798691Z 2025-05-07T20:33:12.9798789Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9798911Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9798980Z 2025-05-07T20:33:12.9799165Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9799268Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9799375Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9799497Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9799631Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9799707Z 2025-05-07T20:33:12.9799804Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9799809Z 2025-05-07T20:33:12.9799917Z moe/activation_test.py:126: 2025-05-07T20:33:12.9800130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9800244Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9800376Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9800943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9801045Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9801399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9801622Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9801984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9802241Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9802611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9802781Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9803112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9803188Z fn() 2025-05-07T20:33:12.9803591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9803672Z self.fn.run( 2025-05-07T20:33:12.9804004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9804101Z kernel = self.compile( 2025-05-07T20:33:12.9804475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9804653Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9804827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9804832Z 2025-05-07T20:33:12.9805035Z self = 2025-05-07T20:33:12.9805803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9806296Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055335b4c0>} 2025-05-07T20:33:12.9807031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9807261Z context = 2025-05-07T20:33:12.9807268Z 2025-05-07T20:33:12.9807441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9807697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9807802Z module_map=module_map) 2025-05-07T20:33:12.9807971Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9808119Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9808196Z E ^ 2025-05-07T20:33:12.9808553Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9808558Z 2025-05-07T20:33:12.9808961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9808966Z 2025-05-07T20:33:12.9809070Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9809335Z self=, 2025-05-07T20:33:12.9809415Z T=16384, 2025-05-07T20:33:12.9809499Z D=5120, 2025-05-07T20:33:12.9809583Z scale_ub=None, 2025-05-07T20:33:12.9809668Z contiguous=True, 2025-05-07T20:33:12.9809753Z compiled=True, 2025-05-07T20:33:12.9809828Z ) 2025-05-07T20:33:12.9810047Z self = 2025-05-07T20:33:12.9810232Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:12.9810238Z 2025-05-07T20:33:12.9810333Z @given( 2025-05-07T20:33:12.9810477Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9810589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9810703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9810822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9810939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9811021Z ) 2025-05-07T20:33:12.9811269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9811363Z def test_silu_mul_quant( 2025-05-07T20:33:12.9811435Z self, 2025-05-07T20:33:12.9811519Z T: int, 2025-05-07T20:33:12.9811594Z D: int, 2025-05-07T20:33:12.9811700Z scale_ub: Optional[float], 2025-05-07T20:33:12.9811795Z contiguous: bool, 2025-05-07T20:33:12.9811878Z compiled: bool, 2025-05-07T20:33:12.9811964Z ) -> None: 2025-05-07T20:33:12.9812057Z torch.manual_seed(2025) 2025-05-07T20:33:12.9812133Z 2025-05-07T20:33:12.9812312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9812385Z 2025-05-07T20:33:12.9812475Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9812607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9812698Z x = x_sign * x_clamp 2025-05-07T20:33:12.9812780Z x0 = x[:, :D] 2025-05-07T20:33:12.9812912Z x1 = x[:, D:] 2025-05-07T20:33:12.9813058Z 2025-05-07T20:33:12.9813143Z if contiguous: 2025-05-07T20:33:12.9813243Z x0 = x0.contiguous() 2025-05-07T20:33:12.9813332Z x1 = x1.contiguous() 2025-05-07T20:33:12.9813412Z 2025-05-07T20:33:12.9813503Z if scale_ub is not None: 2025-05-07T20:33:12.9813609Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9813746Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9813817Z ) 2025-05-07T20:33:12.9813889Z else: 2025-05-07T20:33:12.9813989Z scale_ub_tensor = None 2025-05-07T20:33:12.9814060Z 2025-05-07T20:33:12.9814186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9814282Z op = silu_mul_quant 2025-05-07T20:33:12.9814367Z if compiled: 2025-05-07T20:33:12.9814466Z op = torch.compile(op) 2025-05-07T20:33:12.9814625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9814701Z 2025-05-07T20:33:12.9814798Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9814919Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9814988Z 2025-05-07T20:33:12.9815126Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9815227Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9815364Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9815491Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9815628Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9815702Z 2025-05-07T20:33:12.9815805Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9815810Z 2025-05-07T20:33:12.9815907Z moe/activation_test.py:126: 2025-05-07T20:33:12.9816041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9816147Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9816322Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9816880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9816982Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9817335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9817564Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9817927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9818189Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9818562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9818740Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9819077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9819155Z fn() 2025-05-07T20:33:12.9819556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9819640Z self.fn.run( 2025-05-07T20:33:12.9819971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9820069Z kernel = self.compile( 2025-05-07T20:33:12.9820466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9820667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9820799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9820806Z 2025-05-07T20:33:12.9821054Z self = 2025-05-07T20:33:12.9821822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9822315Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05528c9580>} 2025-05-07T20:33:12.9823052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9823237Z context = 2025-05-07T20:33:12.9823241Z 2025-05-07T20:33:12.9823444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9823707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9823812Z module_map=module_map) 2025-05-07T20:33:12.9823970Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9824081Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9824158Z E ^ 2025-05-07T20:33:12.9824556Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9824560Z 2025-05-07T20:33:12.9824964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9824968Z 2025-05-07T20:33:12.9825068Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9825291Z self=, 2025-05-07T20:33:12.9825366Z T=1, 2025-05-07T20:33:12.9825453Z D=5120, 2025-05-07T20:33:12.9825600Z scale_ub=1200.0, 2025-05-07T20:33:12.9825690Z contiguous=True, 2025-05-07T20:33:12.9825779Z compiled=True, 2025-05-07T20:33:12.9825853Z ) 2025-05-07T20:33:12.9826069Z self = 2025-05-07T20:33:12.9826239Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:12.9826247Z 2025-05-07T20:33:12.9826322Z @given( 2025-05-07T20:33:12.9826442Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9826550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9826663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9826790Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9826902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9826977Z ) 2025-05-07T20:33:12.9827224Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9827322Z def test_silu_mul_quant( 2025-05-07T20:33:12.9827399Z self, 2025-05-07T20:33:12.9827481Z T: int, 2025-05-07T20:33:12.9827555Z D: int, 2025-05-07T20:33:12.9827649Z scale_ub: Optional[float], 2025-05-07T20:33:12.9827743Z contiguous: bool, 2025-05-07T20:33:12.9827826Z compiled: bool, 2025-05-07T20:33:12.9827901Z ) -> None: 2025-05-07T20:33:12.9828002Z torch.manual_seed(2025) 2025-05-07T20:33:12.9828077Z 2025-05-07T20:33:12.9828247Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9828321Z 2025-05-07T20:33:12.9828413Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9828543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9828634Z x = x_sign * x_clamp 2025-05-07T20:33:12.9828712Z x0 = x[:, :D] 2025-05-07T20:33:12.9828797Z x1 = x[:, D:] 2025-05-07T20:33:12.9828873Z 2025-05-07T20:33:12.9828955Z if contiguous: 2025-05-07T20:33:12.9829110Z x0 = x0.contiguous() 2025-05-07T20:33:12.9829199Z x1 = x1.contiguous() 2025-05-07T20:33:12.9829269Z 2025-05-07T20:33:12.9829367Z if scale_ub is not None: 2025-05-07T20:33:12.9829473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9829604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9829682Z ) 2025-05-07T20:33:12.9829759Z else: 2025-05-07T20:33:12.9829855Z scale_ub_tensor = None 2025-05-07T20:33:12.9829924Z 2025-05-07T20:33:12.9830050Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9830143Z op = silu_mul_quant 2025-05-07T20:33:12.9830229Z if compiled: 2025-05-07T20:33:12.9830328Z op = torch.compile(op) 2025-05-07T20:33:12.9830440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9830509Z 2025-05-07T20:33:12.9830600Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9830650Z 2025-05-07T20:33:12.9830749Z moe/activation_test.py:117: 2025-05-07T20:33:12.9830875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9830981Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9831079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9831440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:12.9831578Z return fn(*args, **kwargs) 2025-05-07T20:33:12.9832063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9832162Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9832520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9832739Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9833118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9833214Z kernel = self.compile( 2025-05-07T20:33:12.9833591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9833768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9833896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9833901Z 2025-05-07T20:33:12.9834109Z self = 2025-05-07T20:33:12.9834869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9835364Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553854860>} 2025-05-07T20:33:12.9836100Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9836287Z context = 2025-05-07T20:33:12.9836294Z 2025-05-07T20:33:12.9836460Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9836714Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9836820Z module_map=module_map) 2025-05-07T20:33:12.9836982Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9837077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9837153Z E ^ 2025-05-07T20:33:12.9837508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9837555Z 2025-05-07T20:33:12.9837959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9837964Z 2025-05-07T20:33:12.9838069Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9838284Z self=, 2025-05-07T20:33:12.9838364Z T=1, 2025-05-07T20:33:12.9838445Z D=5120, 2025-05-07T20:33:12.9838526Z scale_ub=None, 2025-05-07T20:33:12.9838617Z contiguous=False, 2025-05-07T20:33:12.9838701Z compiled=True, 2025-05-07T20:33:12.9838775Z ) 2025-05-07T20:33:12.9838991Z self = 2025-05-07T20:33:12.9839152Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:12.9839156Z 2025-05-07T20:33:12.9839233Z @given( 2025-05-07T20:33:12.9839404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9839504Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9839617Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9839738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9839851Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9839924Z ) 2025-05-07T20:33:12.9840247Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9840348Z def test_silu_mul_quant( 2025-05-07T20:33:12.9840434Z self, 2025-05-07T20:33:12.9840510Z T: int, 2025-05-07T20:33:12.9840585Z D: int, 2025-05-07T20:33:12.9840686Z scale_ub: Optional[float], 2025-05-07T20:33:12.9840773Z contiguous: bool, 2025-05-07T20:33:12.9840854Z compiled: bool, 2025-05-07T20:33:12.9840936Z ) -> None: 2025-05-07T20:33:12.9841028Z torch.manual_seed(2025) 2025-05-07T20:33:12.9841103Z 2025-05-07T20:33:12.9841314Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9841389Z 2025-05-07T20:33:12.9841481Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9841612Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9841701Z x = x_sign * x_clamp 2025-05-07T20:33:12.9841787Z x0 = x[:, :D] 2025-05-07T20:33:12.9841863Z x1 = x[:, D:] 2025-05-07T20:33:12.9841939Z 2025-05-07T20:33:12.9842029Z if contiguous: 2025-05-07T20:33:12.9842119Z x0 = x0.contiguous() 2025-05-07T20:33:12.9842209Z x1 = x1.contiguous() 2025-05-07T20:33:12.9842285Z 2025-05-07T20:33:12.9842376Z if scale_ub is not None: 2025-05-07T20:33:12.9842480Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9842617Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9842692Z ) 2025-05-07T20:33:12.9842768Z else: 2025-05-07T20:33:12.9842875Z scale_ub_tensor = None 2025-05-07T20:33:12.9842946Z 2025-05-07T20:33:12.9843077Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9843165Z op = silu_mul_quant 2025-05-07T20:33:12.9843247Z if compiled: 2025-05-07T20:33:12.9843352Z op = torch.compile(op) 2025-05-07T20:33:12.9843456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9843530Z 2025-05-07T20:33:12.9843617Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9843744Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9843811Z 2025-05-07T20:33:12.9843944Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9844048Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9844145Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9844267Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9844404Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9844524Z 2025-05-07T20:33:12.9844630Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9844635Z 2025-05-07T20:33:12.9844734Z moe/activation_test.py:126: 2025-05-07T20:33:12.9844859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9844969Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9845099Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9845649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9845749Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9846102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9846324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9846725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9846980Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9847352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9847517Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9847894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9847969Z fn() 2025-05-07T20:33:12.9848360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9848445Z self.fn.run( 2025-05-07T20:33:12.9848776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9848876Z kernel = self.compile( 2025-05-07T20:33:12.9849296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9849472Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9849601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9849606Z 2025-05-07T20:33:12.9849807Z self = 2025-05-07T20:33:12.9850599Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9851122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553856b60>} 2025-05-07T20:33:12.9851853Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9852049Z context = 2025-05-07T20:33:12.9852053Z 2025-05-07T20:33:12.9852216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9852479Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9852586Z module_map=module_map) 2025-05-07T20:33:12.9852748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9852852Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9852933Z E ^ 2025-05-07T20:33:12.9853335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9853340Z 2025-05-07T20:33:12.9853751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9853799Z 2025-05-07T20:33:12.9853899Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9854121Z self=, 2025-05-07T20:33:12.9854196Z T=1, 2025-05-07T20:33:12.9854270Z D=5120, 2025-05-07T20:33:12.9854354Z scale_ub=None, 2025-05-07T20:33:12.9854437Z contiguous=True, 2025-05-07T20:33:12.9854517Z compiled=False, 2025-05-07T20:33:12.9854590Z ) 2025-05-07T20:33:12.9854805Z self = 2025-05-07T20:33:12.9854964Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:12.9854978Z 2025-05-07T20:33:12.9855050Z @given( 2025-05-07T20:33:12.9855168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9855270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9855429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9855546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9855670Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9855747Z ) 2025-05-07T20:33:12.9855989Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9856091Z def test_silu_mul_quant( 2025-05-07T20:33:12.9856231Z self, 2025-05-07T20:33:12.9856306Z T: int, 2025-05-07T20:33:12.9856384Z D: int, 2025-05-07T20:33:12.9856479Z scale_ub: Optional[float], 2025-05-07T20:33:12.9856568Z contiguous: bool, 2025-05-07T20:33:12.9856654Z compiled: bool, 2025-05-07T20:33:12.9856732Z ) -> None: 2025-05-07T20:33:12.9856829Z torch.manual_seed(2025) 2025-05-07T20:33:12.9856905Z 2025-05-07T20:33:12.9857072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9857153Z 2025-05-07T20:33:12.9857285Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9857410Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9857502Z x = x_sign * x_clamp 2025-05-07T20:33:12.9857581Z x0 = x[:, :D] 2025-05-07T20:33:12.9857659Z x1 = x[:, D:] 2025-05-07T20:33:12.9857731Z 2025-05-07T20:33:12.9857814Z if contiguous: 2025-05-07T20:33:12.9857906Z x0 = x0.contiguous() 2025-05-07T20:33:12.9858002Z x1 = x1.contiguous() 2025-05-07T20:33:12.9858074Z 2025-05-07T20:33:12.9858165Z if scale_ub is not None: 2025-05-07T20:33:12.9858274Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9858406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9858485Z ) 2025-05-07T20:33:12.9858560Z else: 2025-05-07T20:33:12.9858655Z scale_ub_tensor = None 2025-05-07T20:33:12.9858733Z 2025-05-07T20:33:12.9858863Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9858959Z op = silu_mul_quant 2025-05-07T20:33:12.9859047Z if compiled: 2025-05-07T20:33:12.9859145Z op = torch.compile(op) 2025-05-07T20:33:12.9859469Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9859581Z 2025-05-07T20:33:12.9859697Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9859703Z 2025-05-07T20:33:12.9859803Z moe/activation_test.py:117: 2025-05-07T20:33:12.9859935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9860038Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9860140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9860631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9860726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9861088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9861435Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9861838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9861934Z kernel = self.compile( 2025-05-07T20:33:12.9862385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9862582Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9862717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9862722Z 2025-05-07T20:33:12.9862952Z self = 2025-05-07T20:33:12.9863976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9864591Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05538579c0>} 2025-05-07T20:33:12.9865505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9865777Z context = 2025-05-07T20:33:12.9865782Z 2025-05-07T20:33:12.9865968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9866268Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9866379Z module_map=module_map) 2025-05-07T20:33:12.9866563Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9866722Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9866804Z E ^ 2025-05-07T20:33:12.9867159Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9867163Z 2025-05-07T20:33:12.9867566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9867573Z 2025-05-07T20:33:12.9867676Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9867894Z self=, 2025-05-07T20:33:12.9867969Z T=128, 2025-05-07T20:33:12.9868050Z D=5120, 2025-05-07T20:33:12.9868130Z scale_ub=None, 2025-05-07T20:33:12.9868215Z contiguous=False, 2025-05-07T20:33:12.9868297Z compiled=True, 2025-05-07T20:33:12.9868367Z ) 2025-05-07T20:33:12.9868585Z self = 2025-05-07T20:33:12.9868759Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:12.9868764Z 2025-05-07T20:33:12.9868834Z @given( 2025-05-07T20:33:12.9868955Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9869055Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9869169Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9869288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9869398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9869472Z ) 2025-05-07T20:33:12.9869717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9869809Z def test_silu_mul_quant( 2025-05-07T20:33:12.9869888Z self, 2025-05-07T20:33:12.9869979Z T: int, 2025-05-07T20:33:12.9870056Z D: int, 2025-05-07T20:33:12.9870177Z scale_ub: Optional[float], 2025-05-07T20:33:12.9870267Z contiguous: bool, 2025-05-07T20:33:12.9870400Z compiled: bool, 2025-05-07T20:33:12.9870479Z ) -> None: 2025-05-07T20:33:12.9870574Z torch.manual_seed(2025) 2025-05-07T20:33:12.9870646Z 2025-05-07T20:33:12.9870814Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9870887Z 2025-05-07T20:33:12.9870975Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9871103Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9871192Z x = x_sign * x_clamp 2025-05-07T20:33:12.9871274Z x0 = x[:, :D] 2025-05-07T20:33:12.9871354Z x1 = x[:, D:] 2025-05-07T20:33:12.9871424Z 2025-05-07T20:33:12.9871512Z if contiguous: 2025-05-07T20:33:12.9871601Z x0 = x0.contiguous() 2025-05-07T20:33:12.9871690Z x1 = x1.contiguous() 2025-05-07T20:33:12.9871768Z 2025-05-07T20:33:12.9871857Z if scale_ub is not None: 2025-05-07T20:33:12.9871958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9872145Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9872224Z ) 2025-05-07T20:33:12.9872300Z else: 2025-05-07T20:33:12.9872396Z scale_ub_tensor = None 2025-05-07T20:33:12.9872466Z 2025-05-07T20:33:12.9872591Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9872683Z op = silu_mul_quant 2025-05-07T20:33:12.9872805Z if compiled: 2025-05-07T20:33:12.9872909Z op = torch.compile(op) 2025-05-07T20:33:12.9873010Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9873080Z 2025-05-07T20:33:12.9873174Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9873178Z 2025-05-07T20:33:12.9873274Z moe/activation_test.py:117: 2025-05-07T20:33:12.9873398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9873503Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9873600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9874005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:12.9874097Z return fn(*args, **kwargs) 2025-05-07T20:33:12.9874581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9874679Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9875029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9875248Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9875585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9875676Z kernel = self.compile( 2025-05-07T20:33:12.9876056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9876231Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9876354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9876359Z 2025-05-07T20:33:12.9876565Z self = 2025-05-07T20:33:12.9877323Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9877823Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553854ea0>} 2025-05-07T20:33:12.9878561Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9878824Z context = 2025-05-07T20:33:12.9878833Z 2025-05-07T20:33:12.9878997Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9879253Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9879362Z module_map=module_map) 2025-05-07T20:33:12.9879525Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9879621Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9879701Z E ^ 2025-05-07T20:33:12.9880048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9880053Z 2025-05-07T20:33:12.9880486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9880537Z 2025-05-07T20:33:12.9880660Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9880878Z self=, 2025-05-07T20:33:12.9880962Z T=128, 2025-05-07T20:33:12.9881039Z D=7168, 2025-05-07T20:33:12.9881118Z scale_ub=1200.0, 2025-05-07T20:33:12.9881204Z contiguous=False, 2025-05-07T20:33:12.9881284Z compiled=False, 2025-05-07T20:33:12.9881393Z ) 2025-05-07T20:33:12.9881613Z self = 2025-05-07T20:33:12.9881780Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:12.9881785Z 2025-05-07T20:33:12.9881863Z @given( 2025-05-07T20:33:12.9881979Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9882078Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9882195Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9882313Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9882468Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9882546Z ) 2025-05-07T20:33:12.9882787Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9882878Z def test_silu_mul_quant( 2025-05-07T20:33:12.9882954Z self, 2025-05-07T20:33:12.9883025Z T: int, 2025-05-07T20:33:12.9883101Z D: int, 2025-05-07T20:33:12.9883199Z scale_ub: Optional[float], 2025-05-07T20:33:12.9883285Z contiguous: bool, 2025-05-07T20:33:12.9883370Z compiled: bool, 2025-05-07T20:33:12.9883444Z ) -> None: 2025-05-07T20:33:12.9883535Z torch.manual_seed(2025) 2025-05-07T20:33:12.9883614Z 2025-05-07T20:33:12.9883778Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9883850Z 2025-05-07T20:33:12.9883944Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9884065Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9884159Z x = x_sign * x_clamp 2025-05-07T20:33:12.9884244Z x0 = x[:, :D] 2025-05-07T20:33:12.9884321Z x1 = x[:, D:] 2025-05-07T20:33:12.9884398Z 2025-05-07T20:33:12.9884479Z if contiguous: 2025-05-07T20:33:12.9884566Z x0 = x0.contiguous() 2025-05-07T20:33:12.9884654Z x1 = x1.contiguous() 2025-05-07T20:33:12.9884727Z 2025-05-07T20:33:12.9884817Z if scale_ub is not None: 2025-05-07T20:33:12.9884928Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9885060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9885135Z ) 2025-05-07T20:33:12.9885216Z else: 2025-05-07T20:33:12.9885307Z scale_ub_tensor = None 2025-05-07T20:33:12.9885381Z 2025-05-07T20:33:12.9885511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9885599Z op = silu_mul_quant 2025-05-07T20:33:12.9885677Z if compiled: 2025-05-07T20:33:12.9885835Z op = torch.compile(op) 2025-05-07T20:33:12.9885936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9886015Z 2025-05-07T20:33:12.9886102Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9886106Z 2025-05-07T20:33:12.9886202Z moe/activation_test.py:117: 2025-05-07T20:33:12.9886333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9886435Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9886531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9887019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9887112Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9887467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9887752Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9888089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9888183Z kernel = self.compile( 2025-05-07T20:33:12.9888557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9888727Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9888895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9888899Z 2025-05-07T20:33:12.9889099Z self = 2025-05-07T20:33:12.9889859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9890388Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0553fb3e20>} 2025-05-07T20:33:12.9891123Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9891308Z context = 2025-05-07T20:33:12.9891315Z 2025-05-07T20:33:12.9891477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9891736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9891839Z module_map=module_map) 2025-05-07T20:33:12.9892002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9892098Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9892172Z E ^ 2025-05-07T20:33:12.9892528Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9892532Z 2025-05-07T20:33:12.9892935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9892939Z 2025-05-07T20:33:12.9893093Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9893319Z self=, 2025-05-07T20:33:12.9893393Z T=128, 2025-05-07T20:33:12.9893474Z D=5120, 2025-05-07T20:33:12.9893555Z scale_ub=None, 2025-05-07T20:33:12.9893638Z contiguous=False, 2025-05-07T20:33:12.9893725Z compiled=False, 2025-05-07T20:33:12.9893795Z ) 2025-05-07T20:33:12.9894008Z self = 2025-05-07T20:33:12.9894174Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:12.9894179Z 2025-05-07T20:33:12.9894308Z @given( 2025-05-07T20:33:12.9894426Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9894529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9894639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9894756Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9894868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9894941Z ) 2025-05-07T20:33:12.9895187Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9895277Z def test_silu_mul_quant( 2025-05-07T20:33:12.9895352Z self, 2025-05-07T20:33:12.9895426Z T: int, 2025-05-07T20:33:12.9895499Z D: int, 2025-05-07T20:33:12.9895597Z scale_ub: Optional[float], 2025-05-07T20:33:12.9895691Z contiguous: bool, 2025-05-07T20:33:12.9895775Z compiled: bool, 2025-05-07T20:33:12.9895850Z ) -> None: 2025-05-07T20:33:12.9895990Z torch.manual_seed(2025) 2025-05-07T20:33:12.9896063Z 2025-05-07T20:33:12.9896233Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9896301Z 2025-05-07T20:33:12.9896390Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9896518Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9896605Z x = x_sign * x_clamp 2025-05-07T20:33:12.9896682Z x0 = x[:, :D] 2025-05-07T20:33:12.9896807Z x1 = x[:, D:] 2025-05-07T20:33:12.9896874Z 2025-05-07T20:33:12.9896955Z if contiguous: 2025-05-07T20:33:12.9897052Z x0 = x0.contiguous() 2025-05-07T20:33:12.9897140Z x1 = x1.contiguous() 2025-05-07T20:33:12.9897210Z 2025-05-07T20:33:12.9897302Z if scale_ub is not None: 2025-05-07T20:33:12.9897404Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9897537Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9897608Z ) 2025-05-07T20:33:12.9897723Z else: 2025-05-07T20:33:12.9897821Z scale_ub_tensor = None 2025-05-07T20:33:12.9897892Z 2025-05-07T20:33:12.9898018Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9898111Z op = silu_mul_quant 2025-05-07T20:33:12.9898191Z if compiled: 2025-05-07T20:33:12.9898285Z op = torch.compile(op) 2025-05-07T20:33:12.9898387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9898455Z 2025-05-07T20:33:12.9898541Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9898546Z 2025-05-07T20:33:12.9898642Z moe/activation_test.py:117: 2025-05-07T20:33:12.9898766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9898871Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9898966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9899462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9899561Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9899916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9900150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9900519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9900613Z kernel = self.compile( 2025-05-07T20:33:12.9900990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9901164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9901285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9901290Z 2025-05-07T20:33:12.9901495Z self = 2025-05-07T20:33:12.9902299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9902791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055385c400>} 2025-05-07T20:33:12.9903521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9903710Z context = 2025-05-07T20:33:12.9903714Z 2025-05-07T20:33:12.9903876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9904171Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9904285Z module_map=module_map) 2025-05-07T20:33:12.9904444Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9904539Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9904618Z E ^ 2025-05-07T20:33:12.9904962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9905004Z 2025-05-07T20:33:12.9905413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9905417Z 2025-05-07T20:33:12.9905516Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9905734Z self=, 2025-05-07T20:33:12.9905813Z T=128, 2025-05-07T20:33:12.9905889Z D=5120, 2025-05-07T20:33:12.9905969Z scale_ub=1200.0, 2025-05-07T20:33:12.9906096Z contiguous=True, 2025-05-07T20:33:12.9906180Z compiled=False, 2025-05-07T20:33:12.9906255Z ) 2025-05-07T20:33:12.9906469Z self = 2025-05-07T20:33:12.9906636Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:12.9906641Z 2025-05-07T20:33:12.9906722Z @given( 2025-05-07T20:33:12.9906837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9906935Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9907049Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9907163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9907272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9907346Z ) 2025-05-07T20:33:12.9907585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9907678Z def test_silu_mul_quant( 2025-05-07T20:33:12.9907753Z self, 2025-05-07T20:33:12.9907827Z T: int, 2025-05-07T20:33:12.9907907Z D: int, 2025-05-07T20:33:12.9908000Z scale_ub: Optional[float], 2025-05-07T20:33:12.9908085Z contiguous: bool, 2025-05-07T20:33:12.9908172Z compiled: bool, 2025-05-07T20:33:12.9908247Z ) -> None: 2025-05-07T20:33:12.9908339Z torch.manual_seed(2025) 2025-05-07T20:33:12.9908413Z 2025-05-07T20:33:12.9908580Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9908651Z 2025-05-07T20:33:12.9908745Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9908866Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9908954Z x = x_sign * x_clamp 2025-05-07T20:33:12.9909032Z x0 = x[:, :D] 2025-05-07T20:33:12.9909109Z x1 = x[:, D:] 2025-05-07T20:33:12.9909185Z 2025-05-07T20:33:12.9909264Z if contiguous: 2025-05-07T20:33:12.9909349Z x0 = x0.contiguous() 2025-05-07T20:33:12.9909441Z x1 = x1.contiguous() 2025-05-07T20:33:12.9909559Z 2025-05-07T20:33:12.9909647Z if scale_ub is not None: 2025-05-07T20:33:12.9909755Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9909884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9909955Z ) 2025-05-07T20:33:12.9910030Z else: 2025-05-07T20:33:12.9910126Z scale_ub_tensor = None 2025-05-07T20:33:12.9910203Z 2025-05-07T20:33:12.9910356Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9910461Z op = silu_mul_quant 2025-05-07T20:33:12.9910554Z if compiled: 2025-05-07T20:33:12.9910650Z op = torch.compile(op) 2025-05-07T20:33:12.9910752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9910827Z 2025-05-07T20:33:12.9910915Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9910920Z 2025-05-07T20:33:12.9911014Z moe/activation_test.py:117: 2025-05-07T20:33:12.9911184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9911287Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9911383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9911872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9911967Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9912361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9912577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9912906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9913000Z kernel = self.compile( 2025-05-07T20:33:12.9913376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9913588Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9913714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9913719Z 2025-05-07T20:33:12.9917362Z self = 2025-05-07T20:33:12.9918128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9918629Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055385d300>} 2025-05-07T20:33:12.9919364Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9919555Z context = 2025-05-07T20:33:12.9919564Z 2025-05-07T20:33:12.9919726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9919999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9920126Z module_map=module_map) 2025-05-07T20:33:12.9920304Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9920407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9920487Z E ^ 2025-05-07T20:33:12.9920833Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9920838Z 2025-05-07T20:33:12.9921245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9921340Z 2025-05-07T20:33:12.9921441Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9921659Z self=, 2025-05-07T20:33:12.9921736Z T=1, 2025-05-07T20:33:12.9921810Z D=7168, 2025-05-07T20:33:12.9921892Z scale_ub=1200.0, 2025-05-07T20:33:12.9921979Z contiguous=True, 2025-05-07T20:33:12.9922062Z compiled=True, 2025-05-07T20:33:12.9922136Z ) 2025-05-07T20:33:12.9922352Z self = 2025-05-07T20:33:12.9922511Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:12.9922516Z 2025-05-07T20:33:12.9922595Z @given( 2025-05-07T20:33:12.9922709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9922806Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9922919Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9923074Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9923188Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9923265Z ) 2025-05-07T20:33:12.9923504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9923594Z def test_silu_mul_quant( 2025-05-07T20:33:12.9923670Z self, 2025-05-07T20:33:12.9923743Z T: int, 2025-05-07T20:33:12.9923817Z D: int, 2025-05-07T20:33:12.9923954Z scale_ub: Optional[float], 2025-05-07T20:33:12.9924041Z contiguous: bool, 2025-05-07T20:33:12.9924125Z compiled: bool, 2025-05-07T20:33:12.9924203Z ) -> None: 2025-05-07T20:33:12.9924295Z torch.manual_seed(2025) 2025-05-07T20:33:12.9924370Z 2025-05-07T20:33:12.9924534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9924605Z 2025-05-07T20:33:12.9924696Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9924821Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9924990Z x = x_sign * x_clamp 2025-05-07T20:33:12.9925075Z x0 = x[:, :D] 2025-05-07T20:33:12.9925152Z x1 = x[:, D:] 2025-05-07T20:33:12.9925224Z 2025-05-07T20:33:12.9925306Z if contiguous: 2025-05-07T20:33:12.9925392Z x0 = x0.contiguous() 2025-05-07T20:33:12.9925479Z x1 = x1.contiguous() 2025-05-07T20:33:12.9925547Z 2025-05-07T20:33:12.9925638Z if scale_ub is not None: 2025-05-07T20:33:12.9925742Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9925872Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9925944Z ) 2025-05-07T20:33:12.9926018Z else: 2025-05-07T20:33:12.9926107Z scale_ub_tensor = None 2025-05-07T20:33:12.9926177Z 2025-05-07T20:33:12.9926310Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9926397Z op = silu_mul_quant 2025-05-07T20:33:12.9926477Z if compiled: 2025-05-07T20:33:12.9926582Z op = torch.compile(op) 2025-05-07T20:33:12.9926684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9926753Z 2025-05-07T20:33:12.9926840Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9926844Z 2025-05-07T20:33:12.9926937Z moe/activation_test.py:117: 2025-05-07T20:33:12.9927065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9927164Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9927257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9927621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:12.9927714Z return fn(*args, **kwargs) 2025-05-07T20:33:12.9928195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9928286Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9928638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9928911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9929241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9929330Z kernel = self.compile( 2025-05-07T20:33:12.9929713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9929885Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9930009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9930014Z 2025-05-07T20:33:12.9930217Z self = 2025-05-07T20:33:12.9931015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9931518Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055385eac0>} 2025-05-07T20:33:12.9932245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9932472Z context = 2025-05-07T20:33:12.9932476Z 2025-05-07T20:33:12.9932637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9932895Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9933068Z module_map=module_map) 2025-05-07T20:33:12.9933271Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9933369Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9933446Z E ^ 2025-05-07T20:33:12.9933792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9933797Z 2025-05-07T20:33:12.9934200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9934207Z 2025-05-07T20:33:12.9934308Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9934527Z self=, 2025-05-07T20:33:12.9934600Z T=1, 2025-05-07T20:33:12.9934670Z D=7168, 2025-05-07T20:33:12.9934753Z scale_ub=1200.0, 2025-05-07T20:33:12.9934837Z contiguous=False, 2025-05-07T20:33:12.9934917Z compiled=True, 2025-05-07T20:33:12.9934992Z ) 2025-05-07T20:33:12.9935210Z self = 2025-05-07T20:33:12.9935371Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:12.9935382Z 2025-05-07T20:33:12.9935456Z @given( 2025-05-07T20:33:12.9935571Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9935672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9935780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9935896Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9936008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9936078Z ) 2025-05-07T20:33:12.9936318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9936411Z def test_silu_mul_quant( 2025-05-07T20:33:12.9936481Z self, 2025-05-07T20:33:12.9936555Z T: int, 2025-05-07T20:33:12.9936630Z D: int, 2025-05-07T20:33:12.9936730Z scale_ub: Optional[float], 2025-05-07T20:33:12.9936866Z contiguous: bool, 2025-05-07T20:33:12.9936950Z compiled: bool, 2025-05-07T20:33:12.9937024Z ) -> None: 2025-05-07T20:33:12.9937119Z torch.manual_seed(2025) 2025-05-07T20:33:12.9937190Z 2025-05-07T20:33:12.9937353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9937430Z 2025-05-07T20:33:12.9937517Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9937640Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9937728Z x = x_sign * x_clamp 2025-05-07T20:33:12.9937803Z x0 = x[:, :D] 2025-05-07T20:33:12.9937882Z x1 = x[:, D:] 2025-05-07T20:33:12.9937951Z 2025-05-07T20:33:12.9938030Z if contiguous: 2025-05-07T20:33:12.9938126Z x0 = x0.contiguous() 2025-05-07T20:33:12.9938213Z x1 = x1.contiguous() 2025-05-07T20:33:12.9938283Z 2025-05-07T20:33:12.9938372Z if scale_ub is not None: 2025-05-07T20:33:12.9938519Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9938651Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9938729Z ) 2025-05-07T20:33:12.9938800Z else: 2025-05-07T20:33:12.9938895Z scale_ub_tensor = None 2025-05-07T20:33:12.9938967Z 2025-05-07T20:33:12.9939093Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9939219Z op = silu_mul_quant 2025-05-07T20:33:12.9939305Z if compiled: 2025-05-07T20:33:12.9939400Z op = torch.compile(op) 2025-05-07T20:33:12.9939499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9939570Z 2025-05-07T20:33:12.9939655Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9939659Z 2025-05-07T20:33:12.9939756Z moe/activation_test.py:117: 2025-05-07T20:33:12.9939879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9939978Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9940121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9940532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:12.9940622Z return fn(*args, **kwargs) 2025-05-07T20:33:12.9941105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9941203Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9941552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9941770Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9942098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9942189Z kernel = self.compile( 2025-05-07T20:33:12.9942564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9942740Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9942863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9942868Z 2025-05-07T20:33:12.9943067Z self = 2025-05-07T20:33:12.9943826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9944317Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05538079c0>} 2025-05-07T20:33:12.9945047Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9945271Z context = 2025-05-07T20:33:12.9945276Z 2025-05-07T20:33:12.9945435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9945693Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9945800Z module_map=module_map) 2025-05-07T20:33:12.9945964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9946060Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9946136Z E ^ 2025-05-07T20:33:12.9946484Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9946488Z 2025-05-07T20:33:12.9946933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9946941Z 2025-05-07T20:33:12.9947042Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9947258Z self=, 2025-05-07T20:33:12.9947335Z T=1, 2025-05-07T20:33:12.9947412Z D=7168, 2025-05-07T20:33:12.9947490Z scale_ub=None, 2025-05-07T20:33:12.9947575Z contiguous=False, 2025-05-07T20:33:12.9947699Z compiled=True, 2025-05-07T20:33:12.9947770Z ) 2025-05-07T20:33:12.9947985Z self = 2025-05-07T20:33:12.9948147Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:12.9948151Z 2025-05-07T20:33:12.9948224Z @given( 2025-05-07T20:33:12.9948345Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9948441Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9948551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9948734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9948845Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9948918Z ) 2025-05-07T20:33:12.9949158Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9949245Z def test_silu_mul_quant( 2025-05-07T20:33:12.9949316Z self, 2025-05-07T20:33:12.9949395Z T: int, 2025-05-07T20:33:12.9949471Z D: int, 2025-05-07T20:33:12.9949568Z scale_ub: Optional[float], 2025-05-07T20:33:12.9949655Z contiguous: bool, 2025-05-07T20:33:12.9949736Z compiled: bool, 2025-05-07T20:33:12.9949815Z ) -> None: 2025-05-07T20:33:12.9949915Z torch.manual_seed(2025) 2025-05-07T20:33:12.9949999Z 2025-05-07T20:33:12.9950190Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9950261Z 2025-05-07T20:33:12.9950350Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9950478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9950564Z x = x_sign * x_clamp 2025-05-07T20:33:12.9950641Z x0 = x[:, :D] 2025-05-07T20:33:12.9950723Z x1 = x[:, D:] 2025-05-07T20:33:12.9950792Z 2025-05-07T20:33:12.9950871Z if contiguous: 2025-05-07T20:33:12.9950961Z x0 = x0.contiguous() 2025-05-07T20:33:12.9951047Z x1 = x1.contiguous() 2025-05-07T20:33:12.9951119Z 2025-05-07T20:33:12.9951211Z if scale_ub is not None: 2025-05-07T20:33:12.9951313Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9951448Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9951519Z ) 2025-05-07T20:33:12.9951592Z else: 2025-05-07T20:33:12.9951685Z scale_ub_tensor = None 2025-05-07T20:33:12.9951751Z 2025-05-07T20:33:12.9951876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9951964Z op = silu_mul_quant 2025-05-07T20:33:12.9952096Z if compiled: 2025-05-07T20:33:12.9952194Z op = torch.compile(op) 2025-05-07T20:33:12.9952301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9952369Z 2025-05-07T20:33:12.9952459Z y_fp8, y_scale = fn() 2025-05-07T20:33:12.9952579Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:12.9952648Z 2025-05-07T20:33:12.9952785Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9952882Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:12.9952979Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:12.9953101Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:12.9953236Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9953308Z 2025-05-07T20:33:12.9953413Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:12.9953418Z 2025-05-07T20:33:12.9953509Z moe/activation_test.py:126: 2025-05-07T20:33:12.9953683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9953787Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:12.9953916Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:12.9954465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:12.9954602Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:12.9954950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9955167Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9955523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:12.9955777Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:12.9956834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:12.9957002Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:12.9957339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:12.9957415Z fn() 2025-05-07T20:33:12.9957810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:12.9957891Z self.fn.run( 2025-05-07T20:33:12.9958221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9958314Z kernel = self.compile( 2025-05-07T20:33:12.9958682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9958855Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9958986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9958990Z 2025-05-07T20:33:12.9959368Z self = 2025-05-07T20:33:12.9960239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9960739Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552298b80>} 2025-05-07T20:33:12.9961469Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9961667Z context = 2025-05-07T20:33:12.9961759Z 2025-05-07T20:33:12.9961942Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9962249Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9962360Z module_map=module_map) 2025-05-07T20:33:12.9962540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9962654Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:12.9962729Z E ^ 2025-05-07T20:33:12.9963152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9963156Z 2025-05-07T20:33:12.9963647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9963651Z 2025-05-07T20:33:12.9963757Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9964077Z self=, 2025-05-07T20:33:12.9964156Z T=1, 2025-05-07T20:33:12.9964231Z D=5120, 2025-05-07T20:33:12.9964321Z scale_ub=1200.0, 2025-05-07T20:33:12.9964407Z contiguous=False, 2025-05-07T20:33:12.9964492Z compiled=True, 2025-05-07T20:33:12.9964566Z ) 2025-05-07T20:33:12.9964811Z self = 2025-05-07T20:33:12.9965058Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:12.9965063Z 2025-05-07T20:33:12.9965137Z @given( 2025-05-07T20:33:12.9965262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9965367Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9965488Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9965613Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9965738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9965871Z ) 2025-05-07T20:33:12.9966158Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9966254Z def test_silu_mul_quant( 2025-05-07T20:33:12.9966329Z self, 2025-05-07T20:33:12.9966408Z T: int, 2025-05-07T20:33:12.9966484Z D: int, 2025-05-07T20:33:12.9966584Z scale_ub: Optional[float], 2025-05-07T20:33:12.9966676Z contiguous: bool, 2025-05-07T20:33:12.9966764Z compiled: bool, 2025-05-07T20:33:12.9966839Z ) -> None: 2025-05-07T20:33:12.9966941Z torch.manual_seed(2025) 2025-05-07T20:33:12.9967012Z 2025-05-07T20:33:12.9967195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9967272Z 2025-05-07T20:33:12.9967364Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9967498Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9967586Z x = x_sign * x_clamp 2025-05-07T20:33:12.9967667Z x0 = x[:, :D] 2025-05-07T20:33:12.9967754Z x1 = x[:, D:] 2025-05-07T20:33:12.9967824Z 2025-05-07T20:33:12.9967910Z if contiguous: 2025-05-07T20:33:12.9968001Z x0 = x0.contiguous() 2025-05-07T20:33:12.9968097Z x1 = x1.contiguous() 2025-05-07T20:33:12.9968168Z 2025-05-07T20:33:12.9968258Z if scale_ub is not None: 2025-05-07T20:33:12.9968369Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9968515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9968589Z ) 2025-05-07T20:33:12.9968667Z else: 2025-05-07T20:33:12.9968763Z scale_ub_tensor = None 2025-05-07T20:33:12.9968837Z 2025-05-07T20:33:12.9968974Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9969065Z op = silu_mul_quant 2025-05-07T20:33:12.9969156Z if compiled: 2025-05-07T20:33:12.9969258Z op = torch.compile(op) 2025-05-07T20:33:12.9969369Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9969488Z 2025-05-07T20:33:12.9969579Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9969583Z 2025-05-07T20:33:12.9969681Z moe/activation_test.py:117: 2025-05-07T20:33:12.9969820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9969925Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9970044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9970517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:12.9970611Z return fn(*args, **kwargs) 2025-05-07T20:33:12.9971208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9971306Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9971765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9972028Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9972428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9972525Z kernel = self.compile( 2025-05-07T20:33:12.9973091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9973309Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9973434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9973439Z 2025-05-07T20:33:12.9973640Z self = 2025-05-07T20:33:12.9974442Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9974937Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552299e40>} 2025-05-07T20:33:12.9975662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9975854Z context = 2025-05-07T20:33:12.9975859Z 2025-05-07T20:33:12.9976018Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9976275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9976375Z module_map=module_map) 2025-05-07T20:33:12.9976534Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9976639Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9976712Z E ^ 2025-05-07T20:33:12.9977058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9977063Z 2025-05-07T20:33:12.9977466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9977473Z 2025-05-07T20:33:12.9977571Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9977789Z self=, 2025-05-07T20:33:12.9977862Z T=1, 2025-05-07T20:33:12.9977936Z D=5120, 2025-05-07T20:33:12.9978019Z scale_ub=1200.0, 2025-05-07T20:33:12.9978102Z contiguous=False, 2025-05-07T20:33:12.9978181Z compiled=False, 2025-05-07T20:33:12.9978249Z ) 2025-05-07T20:33:12.9978461Z self = 2025-05-07T20:33:12.9978632Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:12.9978679Z 2025-05-07T20:33:12.9978752Z @given( 2025-05-07T20:33:12.9978869Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9978971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9979081Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9979192Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9979317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9979387Z ) 2025-05-07T20:33:12.9979627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9979719Z def test_silu_mul_quant( 2025-05-07T20:33:12.9979791Z self, 2025-05-07T20:33:12.9979865Z T: int, 2025-05-07T20:33:12.9979941Z D: int, 2025-05-07T20:33:12.9980038Z scale_ub: Optional[float], 2025-05-07T20:33:12.9980126Z contiguous: bool, 2025-05-07T20:33:12.9980252Z compiled: bool, 2025-05-07T20:33:12.9980329Z ) -> None: 2025-05-07T20:33:12.9980425Z torch.manual_seed(2025) 2025-05-07T20:33:12.9980496Z 2025-05-07T20:33:12.9980660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9980736Z 2025-05-07T20:33:12.9980825Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9980947Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9981101Z x = x_sign * x_clamp 2025-05-07T20:33:12.9981177Z x0 = x[:, :D] 2025-05-07T20:33:12.9981252Z x1 = x[:, D:] 2025-05-07T20:33:12.9981327Z 2025-05-07T20:33:12.9981408Z if contiguous: 2025-05-07T20:33:12.9981494Z x0 = x0.contiguous() 2025-05-07T20:33:12.9981582Z x1 = x1.contiguous() 2025-05-07T20:33:12.9981652Z 2025-05-07T20:33:12.9981745Z if scale_ub is not None: 2025-05-07T20:33:12.9981847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9982022Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9982098Z ) 2025-05-07T20:33:12.9982170Z else: 2025-05-07T20:33:12.9982261Z scale_ub_tensor = None 2025-05-07T20:33:12.9982338Z 2025-05-07T20:33:12.9982462Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9982547Z op = silu_mul_quant 2025-05-07T20:33:12.9982631Z if compiled: 2025-05-07T20:33:12.9982729Z op = torch.compile(op) 2025-05-07T20:33:12.9982830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9982903Z 2025-05-07T20:33:12.9982989Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9982993Z 2025-05-07T20:33:12.9983091Z moe/activation_test.py:117: 2025-05-07T20:33:12.9983215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9983311Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9983409Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9983904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9983999Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9984350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9984567Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9984905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9984997Z kernel = self.compile( 2025-05-07T20:33:12.9985369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9985541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9985662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9985712Z 2025-05-07T20:33:12.9985921Z self = 2025-05-07T20:33:12.9986677Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9987166Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f055229aac0>} 2025-05-07T20:33:12.9987898Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:12.9988085Z context = 2025-05-07T20:33:12.9988089Z 2025-05-07T20:33:12.9988294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:12.9988551Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:12.9988654Z module_map=module_map) 2025-05-07T20:33:12.9988813Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:12.9988907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:12.9988986Z E ^ 2025-05-07T20:33:12.9989372Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:12.9989376Z 2025-05-07T20:33:12.9989780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:12.9989784Z 2025-05-07T20:33:12.9989886Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:12.9990103Z self=, 2025-05-07T20:33:12.9990183Z T=16384, 2025-05-07T20:33:12.9990298Z D=5120, 2025-05-07T20:33:12.9990378Z scale_ub=1200.0, 2025-05-07T20:33:12.9990465Z contiguous=False, 2025-05-07T20:33:12.9990545Z compiled=True, 2025-05-07T20:33:12.9990611Z ) 2025-05-07T20:33:12.9990823Z self = 2025-05-07T20:33:12.9990995Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:12.9991003Z 2025-05-07T20:33:12.9991077Z @given( 2025-05-07T20:33:12.9991196Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:12.9991290Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:12.9991401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:12.9991518Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:12.9991627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:12.9991698Z ) 2025-05-07T20:33:12.9991940Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:12.9992034Z def test_silu_mul_quant( 2025-05-07T20:33:12.9992108Z self, 2025-05-07T20:33:12.9992181Z T: int, 2025-05-07T20:33:12.9992254Z D: int, 2025-05-07T20:33:12.9992354Z scale_ub: Optional[float], 2025-05-07T20:33:12.9992444Z contiguous: bool, 2025-05-07T20:33:12.9992523Z compiled: bool, 2025-05-07T20:33:12.9992600Z ) -> None: 2025-05-07T20:33:12.9992694Z torch.manual_seed(2025) 2025-05-07T20:33:12.9992761Z 2025-05-07T20:33:12.9992927Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:12.9992997Z 2025-05-07T20:33:12.9993094Z x_sign = torch.sign(x) 2025-05-07T20:33:12.9993215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:12.9993298Z x = x_sign * x_clamp 2025-05-07T20:33:12.9993378Z x0 = x[:, :D] 2025-05-07T20:33:12.9993453Z x1 = x[:, D:] 2025-05-07T20:33:12.9993523Z 2025-05-07T20:33:12.9993610Z if contiguous: 2025-05-07T20:33:12.9993745Z x0 = x0.contiguous() 2025-05-07T20:33:12.9993830Z x1 = x1.contiguous() 2025-05-07T20:33:12.9993900Z 2025-05-07T20:33:12.9993987Z if scale_ub is not None: 2025-05-07T20:33:12.9994088Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:12.9994219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:12.9994291Z ) 2025-05-07T20:33:12.9994373Z else: 2025-05-07T20:33:12.9994462Z scale_ub_tensor = None 2025-05-07T20:33:12.9994532Z 2025-05-07T20:33:12.9994662Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:12.9994748Z op = silu_mul_quant 2025-05-07T20:33:12.9994826Z if compiled: 2025-05-07T20:33:12.9994926Z op = torch.compile(op) 2025-05-07T20:33:12.9995025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9995095Z 2025-05-07T20:33:12.9995188Z > y_fp8, y_scale = fn() 2025-05-07T20:33:12.9995241Z 2025-05-07T20:33:12.9995334Z moe/activation_test.py:117: 2025-05-07T20:33:12.9995458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9995558Z moe/activation_test.py:115: in fn 2025-05-07T20:33:12.9995652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:12.9996015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:12.9996147Z return fn(*args, **kwargs) 2025-05-07T20:33:12.9996629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:12.9996727Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:12.9997075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:12.9997299Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:12.9997670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:12.9997760Z kernel = self.compile( 2025-05-07T20:33:12.9998132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:12.9998305Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:12.9998428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:12.9998433Z 2025-05-07T20:33:12.9998639Z self = 2025-05-07T20:33:12.9999393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:12.9999895Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ff0180>} 2025-05-07T20:33:13.0000625Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0000812Z context = 2025-05-07T20:33:13.0000820Z 2025-05-07T20:33:13.0000980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0001232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0001339Z module_map=module_map) 2025-05-07T20:33:13.0001498Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0001591Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0001665Z E ^ 2025-05-07T20:33:13.0002015Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0002060Z 2025-05-07T20:33:13.0002467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0002472Z 2025-05-07T20:33:13.0002572Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0002788Z self=, 2025-05-07T20:33:13.0002865Z T=2048, 2025-05-07T20:33:13.0002940Z D=7168, 2025-05-07T20:33:13.0003020Z scale_ub=1200.0, 2025-05-07T20:33:13.0003107Z contiguous=False, 2025-05-07T20:33:13.0003187Z compiled=True, 2025-05-07T20:33:13.0003264Z ) 2025-05-07T20:33:13.0003477Z self = 2025-05-07T20:33:13.0003646Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:13.0003650Z 2025-05-07T20:33:13.0003771Z @given( 2025-05-07T20:33:13.0003889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0003986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0004100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0004211Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0004333Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0004443Z ) 2025-05-07T20:33:13.0004682Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0004773Z def test_silu_mul_quant( 2025-05-07T20:33:13.0004845Z self, 2025-05-07T20:33:13.0004918Z T: int, 2025-05-07T20:33:13.0004995Z D: int, 2025-05-07T20:33:13.0005090Z scale_ub: Optional[float], 2025-05-07T20:33:13.0005176Z contiguous: bool, 2025-05-07T20:33:13.0005260Z compiled: bool, 2025-05-07T20:33:13.0005331Z ) -> None: 2025-05-07T20:33:13.0005422Z torch.manual_seed(2025) 2025-05-07T20:33:13.0005537Z 2025-05-07T20:33:13.0005702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0005774Z 2025-05-07T20:33:13.0005866Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0005988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0006079Z x = x_sign * x_clamp 2025-05-07T20:33:13.0006155Z x0 = x[:, :D] 2025-05-07T20:33:13.0006235Z x1 = x[:, D:] 2025-05-07T20:33:13.0006308Z 2025-05-07T20:33:13.0006387Z if contiguous: 2025-05-07T20:33:13.0006475Z x0 = x0.contiguous() 2025-05-07T20:33:13.0006560Z x1 = x1.contiguous() 2025-05-07T20:33:13.0006629Z 2025-05-07T20:33:13.0006719Z if scale_ub is not None: 2025-05-07T20:33:13.0006825Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0006954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0007023Z ) 2025-05-07T20:33:13.0007103Z else: 2025-05-07T20:33:13.0007200Z scale_ub_tensor = None 2025-05-07T20:33:13.0007272Z 2025-05-07T20:33:13.0007396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0007483Z op = silu_mul_quant 2025-05-07T20:33:13.0007566Z if compiled: 2025-05-07T20:33:13.0007667Z op = torch.compile(op) 2025-05-07T20:33:13.0007768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0007841Z 2025-05-07T20:33:13.0007929Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0007933Z 2025-05-07T20:33:13.0008027Z moe/activation_test.py:117: 2025-05-07T20:33:13.0008155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0008252Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0008351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0008708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0008862Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0009353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0009448Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0009795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0010032Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0010401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0010493Z kernel = self.compile( 2025-05-07T20:33:13.0010864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0011034Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0011223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0011230Z 2025-05-07T20:33:13.0011430Z self = 2025-05-07T20:33:13.0012192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0012725Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ff0ea0>} 2025-05-07T20:33:13.0013506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0013700Z context = 2025-05-07T20:33:13.0013710Z 2025-05-07T20:33:13.0013908Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0014166Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0014272Z module_map=module_map) 2025-05-07T20:33:13.0014427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0014527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0014601Z E ^ 2025-05-07T20:33:13.0014945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0014954Z 2025-05-07T20:33:13.0015358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0015362Z 2025-05-07T20:33:13.0015459Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0015683Z self=, 2025-05-07T20:33:13.0015761Z T=1, 2025-05-07T20:33:13.0015831Z D=5120, 2025-05-07T20:33:13.0015908Z scale_ub=None, 2025-05-07T20:33:13.0015992Z contiguous=False, 2025-05-07T20:33:13.0016072Z compiled=False, 2025-05-07T20:33:13.0016142Z ) 2025-05-07T20:33:13.0016359Z self = 2025-05-07T20:33:13.0016525Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:13.0016533Z 2025-05-07T20:33:13.0016608Z @given( 2025-05-07T20:33:13.0016726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0016825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0016937Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0017049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0017166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0017237Z ) 2025-05-07T20:33:13.0017480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0017613Z def test_silu_mul_quant( 2025-05-07T20:33:13.0017687Z self, 2025-05-07T20:33:13.0017770Z T: int, 2025-05-07T20:33:13.0017842Z D: int, 2025-05-07T20:33:13.0017939Z scale_ub: Optional[float], 2025-05-07T20:33:13.0018029Z contiguous: bool, 2025-05-07T20:33:13.0018113Z compiled: bool, 2025-05-07T20:33:13.0018190Z ) -> None: 2025-05-07T20:33:13.0018282Z torch.manual_seed(2025) 2025-05-07T20:33:13.0018352Z 2025-05-07T20:33:13.0018517Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0018590Z 2025-05-07T20:33:13.0018679Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0018799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0018889Z x = x_sign * x_clamp 2025-05-07T20:33:13.0018965Z x0 = x[:, :D] 2025-05-07T20:33:13.0019042Z x1 = x[:, D:] 2025-05-07T20:33:13.0019156Z 2025-05-07T20:33:13.0019239Z if contiguous: 2025-05-07T20:33:13.0019332Z x0 = x0.contiguous() 2025-05-07T20:33:13.0019418Z x1 = x1.contiguous() 2025-05-07T20:33:13.0019488Z 2025-05-07T20:33:13.0019575Z if scale_ub is not None: 2025-05-07T20:33:13.0019678Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0019808Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0019925Z ) 2025-05-07T20:33:13.0019999Z else: 2025-05-07T20:33:13.0020089Z scale_ub_tensor = None 2025-05-07T20:33:13.0020160Z 2025-05-07T20:33:13.0020286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0020373Z op = silu_mul_quant 2025-05-07T20:33:13.0020452Z if compiled: 2025-05-07T20:33:13.0020547Z op = torch.compile(op) 2025-05-07T20:33:13.0020651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0020723Z 2025-05-07T20:33:13.0020854Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0020859Z 2025-05-07T20:33:13.0020955Z moe/activation_test.py:117: 2025-05-07T20:33:13.0021081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0021179Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0021276Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0021761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0021863Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0022211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0022426Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0022759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0022856Z kernel = self.compile( 2025-05-07T20:33:13.0023226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0023402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0023523Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0023528Z 2025-05-07T20:33:13.0023731Z self = 2025-05-07T20:33:13.0024488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0024981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ff1e40>} 2025-05-07T20:33:13.0025759Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0025946Z context = 2025-05-07T20:33:13.0025950Z 2025-05-07T20:33:13.0026113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0026368Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0026476Z module_map=module_map) 2025-05-07T20:33:13.0026633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0026729Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0026808Z E ^ 2025-05-07T20:33:13.0027153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0027200Z 2025-05-07T20:33:13.0027607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0027616Z 2025-05-07T20:33:13.0027714Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0027931Z self=, 2025-05-07T20:33:13.0028005Z T=4096, 2025-05-07T20:33:13.0028117Z D=7168, 2025-05-07T20:33:13.0028197Z scale_ub=1200.0, 2025-05-07T20:33:13.0028283Z contiguous=False, 2025-05-07T20:33:13.0028361Z compiled=False, 2025-05-07T20:33:13.0028431Z ) 2025-05-07T20:33:13.0028645Z self = 2025-05-07T20:33:13.0028814Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:13.0028818Z 2025-05-07T20:33:13.0028892Z @given( 2025-05-07T20:33:13.0029008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0029145Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0029260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0029373Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0029485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0029558Z ) 2025-05-07T20:33:13.0029796Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0029887Z def test_silu_mul_quant( 2025-05-07T20:33:13.0029965Z self, 2025-05-07T20:33:13.0030038Z T: int, 2025-05-07T20:33:13.0030112Z D: int, 2025-05-07T20:33:13.0030213Z scale_ub: Optional[float], 2025-05-07T20:33:13.0030300Z contiguous: bool, 2025-05-07T20:33:13.0030384Z compiled: bool, 2025-05-07T20:33:13.0030459Z ) -> None: 2025-05-07T20:33:13.0030550Z torch.manual_seed(2025) 2025-05-07T20:33:13.0030622Z 2025-05-07T20:33:13.0030788Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0030866Z 2025-05-07T20:33:13.0030961Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0031084Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0031168Z x = x_sign * x_clamp 2025-05-07T20:33:13.0031246Z x0 = x[:, :D] 2025-05-07T20:33:13.0031322Z x1 = x[:, D:] 2025-05-07T20:33:13.0031392Z 2025-05-07T20:33:13.0031473Z if contiguous: 2025-05-07T20:33:13.0031561Z x0 = x0.contiguous() 2025-05-07T20:33:13.0031650Z x1 = x1.contiguous() 2025-05-07T20:33:13.0031722Z 2025-05-07T20:33:13.0031808Z if scale_ub is not None: 2025-05-07T20:33:13.0031913Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0032044Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0032115Z ) 2025-05-07T20:33:13.0032188Z else: 2025-05-07T20:33:13.0032279Z scale_ub_tensor = None 2025-05-07T20:33:13.0032352Z 2025-05-07T20:33:13.0032531Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0032619Z op = silu_mul_quant 2025-05-07T20:33:13.0032698Z if compiled: 2025-05-07T20:33:13.0032798Z op = torch.compile(op) 2025-05-07T20:33:13.0032897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0032967Z 2025-05-07T20:33:13.0033055Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0033062Z 2025-05-07T20:33:13.0033156Z moe/activation_test.py:117: 2025-05-07T20:33:13.0033283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0033378Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0033471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0033961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0034054Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0034446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0034675Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0035007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0035098Z kernel = self.compile( 2025-05-07T20:33:13.0035471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0038843Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0038974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0038980Z 2025-05-07T20:33:13.0039185Z self = 2025-05-07T20:33:13.0040012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0040560Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ff3380>} 2025-05-07T20:33:13.0041294Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0041484Z context = 2025-05-07T20:33:13.0041488Z 2025-05-07T20:33:13.0041651Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0041905Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0042013Z module_map=module_map) 2025-05-07T20:33:13.0042178Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0042275Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0042353Z E ^ 2025-05-07T20:33:13.0042700Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0042705Z 2025-05-07T20:33:13.0043108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0043115Z 2025-05-07T20:33:13.0043222Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0043436Z self=, 2025-05-07T20:33:13.0043515Z T=16384, 2025-05-07T20:33:13.0043589Z D=7168, 2025-05-07T20:33:13.0043669Z scale_ub=None, 2025-05-07T20:33:13.0043754Z contiguous=True, 2025-05-07T20:33:13.0043834Z compiled=True, 2025-05-07T20:33:13.0043901Z ) 2025-05-07T20:33:13.0044120Z self = 2025-05-07T20:33:13.0044355Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.0044360Z 2025-05-07T20:33:13.0044434Z @given( 2025-05-07T20:33:13.0044551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0044645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0044755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0044875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0044983Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0045065Z ) 2025-05-07T20:33:13.0045305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0045393Z def test_silu_mul_quant( 2025-05-07T20:33:13.0045470Z self, 2025-05-07T20:33:13.0045545Z T: int, 2025-05-07T20:33:13.0045616Z D: int, 2025-05-07T20:33:13.0045755Z scale_ub: Optional[float], 2025-05-07T20:33:13.0045848Z contiguous: bool, 2025-05-07T20:33:13.0045928Z compiled: bool, 2025-05-07T20:33:13.0046012Z ) -> None: 2025-05-07T20:33:13.0046103Z torch.manual_seed(2025) 2025-05-07T20:33:13.0046171Z 2025-05-07T20:33:13.0046337Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0046408Z 2025-05-07T20:33:13.0046498Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0046662Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0046745Z x = x_sign * x_clamp 2025-05-07T20:33:13.0046825Z x0 = x[:, :D] 2025-05-07T20:33:13.0046901Z x1 = x[:, D:] 2025-05-07T20:33:13.0046972Z 2025-05-07T20:33:13.0047052Z if contiguous: 2025-05-07T20:33:13.0047138Z x0 = x0.contiguous() 2025-05-07T20:33:13.0047226Z x1 = x1.contiguous() 2025-05-07T20:33:13.0047298Z 2025-05-07T20:33:13.0047383Z if scale_ub is not None: 2025-05-07T20:33:13.0047530Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0047666Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0047739Z ) 2025-05-07T20:33:13.0047816Z else: 2025-05-07T20:33:13.0047907Z scale_ub_tensor = None 2025-05-07T20:33:13.0047978Z 2025-05-07T20:33:13.0048105Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0048193Z op = silu_mul_quant 2025-05-07T20:33:13.0048272Z if compiled: 2025-05-07T20:33:13.0048371Z op = torch.compile(op) 2025-05-07T20:33:13.0048472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0048541Z 2025-05-07T20:33:13.0048629Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0048634Z 2025-05-07T20:33:13.0048724Z moe/activation_test.py:117: 2025-05-07T20:33:13.0048850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0048947Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0049052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0049414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0049503Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0049987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0050087Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0050484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0050705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0051035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0051124Z kernel = self.compile( 2025-05-07T20:33:13.0051502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0051718Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0051842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0051846Z 2025-05-07T20:33:13.0052048Z self = 2025-05-07T20:33:13.0052805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0053369Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05523904a0>} 2025-05-07T20:33:13.0054140Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0054331Z context = 2025-05-07T20:33:13.0054335Z 2025-05-07T20:33:13.0054494Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0054746Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0054984Z module_map=module_map) 2025-05-07T20:33:13.0055142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0055237Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0055315Z E ^ 2025-05-07T20:33:13.0055659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0055664Z 2025-05-07T20:33:13.0056116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0056124Z 2025-05-07T20:33:13.0056222Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0056438Z self=, 2025-05-07T20:33:13.0056513Z T=4096, 2025-05-07T20:33:13.0056587Z D=5120, 2025-05-07T20:33:13.0056668Z scale_ub=None, 2025-05-07T20:33:13.0056754Z contiguous=False, 2025-05-07T20:33:13.0056834Z compiled=True, 2025-05-07T20:33:13.0056909Z ) 2025-05-07T20:33:13.0057122Z self = 2025-05-07T20:33:13.0057288Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:13.0057292Z 2025-05-07T20:33:13.0057372Z @given( 2025-05-07T20:33:13.0057486Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0057583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0057696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0057814Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0057928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0057998Z ) 2025-05-07T20:33:13.0058235Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0058329Z def test_silu_mul_quant( 2025-05-07T20:33:13.0058400Z self, 2025-05-07T20:33:13.0058475Z T: int, 2025-05-07T20:33:13.0058555Z D: int, 2025-05-07T20:33:13.0058650Z scale_ub: Optional[float], 2025-05-07T20:33:13.0058735Z contiguous: bool, 2025-05-07T20:33:13.0058819Z compiled: bool, 2025-05-07T20:33:13.0058896Z ) -> None: 2025-05-07T20:33:13.0058988Z torch.manual_seed(2025) 2025-05-07T20:33:13.0059063Z 2025-05-07T20:33:13.0059430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0059538Z 2025-05-07T20:33:13.0059674Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0059804Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0059985Z x = x_sign * x_clamp 2025-05-07T20:33:13.0060060Z x0 = x[:, :D] 2025-05-07T20:33:13.0060133Z x1 = x[:, D:] 2025-05-07T20:33:13.0060209Z 2025-05-07T20:33:13.0060289Z if contiguous: 2025-05-07T20:33:13.0060375Z x0 = x0.contiguous() 2025-05-07T20:33:13.0060462Z x1 = x1.contiguous() 2025-05-07T20:33:13.0060531Z 2025-05-07T20:33:13.0060620Z if scale_ub is not None: 2025-05-07T20:33:13.0060726Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0060856Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0060935Z ) 2025-05-07T20:33:13.0061009Z else: 2025-05-07T20:33:13.0061099Z scale_ub_tensor = None 2025-05-07T20:33:13.0061172Z 2025-05-07T20:33:13.0061297Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0061383Z op = silu_mul_quant 2025-05-07T20:33:13.0061539Z if compiled: 2025-05-07T20:33:13.0061635Z op = torch.compile(op) 2025-05-07T20:33:13.0061736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0061809Z 2025-05-07T20:33:13.0061895Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0061900Z 2025-05-07T20:33:13.0061994Z moe/activation_test.py:117: 2025-05-07T20:33:13.0062121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0062289Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0062390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0062749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0062837Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0063321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0063420Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0063831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0064052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0064382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0064480Z kernel = self.compile( 2025-05-07T20:33:13.0064849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0065018Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0065143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0065147Z 2025-05-07T20:33:13.0065344Z self = 2025-05-07T20:33:13.0066104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0066598Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f05523911c0>} 2025-05-07T20:33:13.0067325Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0067519Z context = 2025-05-07T20:33:13.0067523Z 2025-05-07T20:33:13.0067687Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0067948Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0068100Z module_map=module_map) 2025-05-07T20:33:13.0068258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0068355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0068431Z E ^ 2025-05-07T20:33:13.0068778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0068786Z 2025-05-07T20:33:13.0069191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0069196Z 2025-05-07T20:33:13.0069293Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0069510Z self=, 2025-05-07T20:33:13.0069580Z T=4096, 2025-05-07T20:33:13.0069655Z D=5120, 2025-05-07T20:33:13.0069740Z scale_ub=1200.0, 2025-05-07T20:33:13.0069823Z contiguous=False, 2025-05-07T20:33:13.0069908Z compiled=False, 2025-05-07T20:33:13.0070029Z ) 2025-05-07T20:33:13.0070278Z self = 2025-05-07T20:33:13.0070472Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:13.0070477Z 2025-05-07T20:33:13.0070549Z @given( 2025-05-07T20:33:13.0070664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0070764Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0070918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0071030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0071143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0071215Z ) 2025-05-07T20:33:13.0071461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0071551Z def test_silu_mul_quant( 2025-05-07T20:33:13.0071621Z self, 2025-05-07T20:33:13.0071699Z T: int, 2025-05-07T20:33:13.0071775Z D: int, 2025-05-07T20:33:13.0071940Z scale_ub: Optional[float], 2025-05-07T20:33:13.0072030Z contiguous: bool, 2025-05-07T20:33:13.0072111Z compiled: bool, 2025-05-07T20:33:13.0072184Z ) -> None: 2025-05-07T20:33:13.0072279Z torch.manual_seed(2025) 2025-05-07T20:33:13.0072349Z 2025-05-07T20:33:13.0072512Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0072586Z 2025-05-07T20:33:13.0072675Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0072795Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0072885Z x = x_sign * x_clamp 2025-05-07T20:33:13.0072960Z x0 = x[:, :D] 2025-05-07T20:33:13.0073041Z x1 = x[:, D:] 2025-05-07T20:33:13.0073110Z 2025-05-07T20:33:13.0073189Z if contiguous: 2025-05-07T20:33:13.0073280Z x0 = x0.contiguous() 2025-05-07T20:33:13.0073365Z x1 = x1.contiguous() 2025-05-07T20:33:13.0073433Z 2025-05-07T20:33:13.0073531Z if scale_ub is not None: 2025-05-07T20:33:13.0073632Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0073761Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0073833Z ) 2025-05-07T20:33:13.0073904Z else: 2025-05-07T20:33:13.0073994Z scale_ub_tensor = None 2025-05-07T20:33:13.0074069Z 2025-05-07T20:33:13.0074195Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0074287Z op = silu_mul_quant 2025-05-07T20:33:13.0074370Z if compiled: 2025-05-07T20:33:13.0074464Z op = torch.compile(op) 2025-05-07T20:33:13.0074567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0074634Z 2025-05-07T20:33:13.0074719Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0074724Z 2025-05-07T20:33:13.0074818Z moe/activation_test.py:117: 2025-05-07T20:33:13.0074945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0075092Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0075189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0075677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0075774Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0076123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0076341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0076676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0076767Z kernel = self.compile( 2025-05-07T20:33:13.0077139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0077354Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0077480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0077485Z 2025-05-07T20:33:13.0077684Z self = 2025-05-07T20:33:13.0078437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0078970Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552392160>} 2025-05-07T20:33:13.0079698Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0079922Z context = 2025-05-07T20:33:13.0079928Z 2025-05-07T20:33:13.0080095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0080350Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0080480Z module_map=module_map) 2025-05-07T20:33:13.0080664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0080755Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0080834Z E ^ 2025-05-07T20:33:13.0081178Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0081183Z 2025-05-07T20:33:13.0081586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0081593Z 2025-05-07T20:33:13.0081694Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0081912Z self=, 2025-05-07T20:33:13.0081989Z T=4096, 2025-05-07T20:33:13.0082061Z D=5120, 2025-05-07T20:33:13.0082139Z scale_ub=1200.0, 2025-05-07T20:33:13.0082227Z contiguous=False, 2025-05-07T20:33:13.0082306Z compiled=True, 2025-05-07T20:33:13.0082375Z ) 2025-05-07T20:33:13.0082589Z self = 2025-05-07T20:33:13.0082764Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:13.0082769Z 2025-05-07T20:33:13.0082849Z @given( 2025-05-07T20:33:13.0082964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0083059Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0083177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0083288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0083399Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0083518Z ) 2025-05-07T20:33:13.0083755Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0083843Z def test_silu_mul_quant( 2025-05-07T20:33:13.0083922Z self, 2025-05-07T20:33:13.0083994Z T: int, 2025-05-07T20:33:13.0084066Z D: int, 2025-05-07T20:33:13.0084165Z scale_ub: Optional[float], 2025-05-07T20:33:13.0084252Z contiguous: bool, 2025-05-07T20:33:13.0084337Z compiled: bool, 2025-05-07T20:33:13.0084410Z ) -> None: 2025-05-07T20:33:13.0084501Z torch.manual_seed(2025) 2025-05-07T20:33:13.0084573Z 2025-05-07T20:33:13.0084739Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0084810Z 2025-05-07T20:33:13.0084901Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0085021Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0085107Z x = x_sign * x_clamp 2025-05-07T20:33:13.0085235Z x0 = x[:, :D] 2025-05-07T20:33:13.0085313Z x1 = x[:, D:] 2025-05-07T20:33:13.0085382Z 2025-05-07T20:33:13.0085463Z if contiguous: 2025-05-07T20:33:13.0085550Z x0 = x0.contiguous() 2025-05-07T20:33:13.0085634Z x1 = x1.contiguous() 2025-05-07T20:33:13.0085704Z 2025-05-07T20:33:13.0085789Z if scale_ub is not None: 2025-05-07T20:33:13.0085895Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0086070Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0086140Z ) 2025-05-07T20:33:13.0086217Z else: 2025-05-07T20:33:13.0086306Z scale_ub_tensor = None 2025-05-07T20:33:13.0086376Z 2025-05-07T20:33:13.0086503Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0086588Z op = silu_mul_quant 2025-05-07T20:33:13.0086668Z if compiled: 2025-05-07T20:33:13.0086766Z op = torch.compile(op) 2025-05-07T20:33:13.0086915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0086985Z 2025-05-07T20:33:13.0087077Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0087082Z 2025-05-07T20:33:13.0087173Z moe/activation_test.py:117: 2025-05-07T20:33:13.0087296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0087397Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0087494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0087852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0087944Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0088424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0088522Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0088875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0089095Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0089429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0089518Z kernel = self.compile( 2025-05-07T20:33:13.0089886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0090060Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0090183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0090187Z 2025-05-07T20:33:13.0090387Z self = 2025-05-07T20:33:13.0091151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0091688Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552393240>} 2025-05-07T20:33:13.0092413Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0092600Z context = 2025-05-07T20:33:13.0092605Z 2025-05-07T20:33:13.0092768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0093085Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0093192Z module_map=module_map) 2025-05-07T20:33:13.0093391Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0093494Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0093571Z E ^ 2025-05-07T20:33:13.0093916Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0093920Z 2025-05-07T20:33:13.0094323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0094372Z 2025-05-07T20:33:13.0094470Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0094686Z self=, 2025-05-07T20:33:13.0094762Z T=2048, 2025-05-07T20:33:13.0094833Z D=7168, 2025-05-07T20:33:13.0094910Z scale_ub=1200.0, 2025-05-07T20:33:13.0094995Z contiguous=False, 2025-05-07T20:33:13.0095074Z compiled=False, 2025-05-07T20:33:13.0095144Z ) 2025-05-07T20:33:13.0095364Z self = 2025-05-07T20:33:13.0095577Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:13.0095583Z 2025-05-07T20:33:13.0095661Z @given( 2025-05-07T20:33:13.0095775Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0095870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0095984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0096100Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0096207Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0096282Z ) 2025-05-07T20:33:13.0096522Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0096610Z def test_silu_mul_quant( 2025-05-07T20:33:13.0096685Z self, 2025-05-07T20:33:13.0096759Z T: int, 2025-05-07T20:33:13.0096832Z D: int, 2025-05-07T20:33:13.0096928Z scale_ub: Optional[float], 2025-05-07T20:33:13.0097015Z contiguous: bool, 2025-05-07T20:33:13.0097103Z compiled: bool, 2025-05-07T20:33:13.0097179Z ) -> None: 2025-05-07T20:33:13.0097270Z torch.manual_seed(2025) 2025-05-07T20:33:13.0097345Z 2025-05-07T20:33:13.0097507Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0097574Z 2025-05-07T20:33:13.0097670Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0097793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0097881Z x = x_sign * x_clamp 2025-05-07T20:33:13.0097958Z x0 = x[:, :D] 2025-05-07T20:33:13.0098034Z x1 = x[:, D:] 2025-05-07T20:33:13.0098104Z 2025-05-07T20:33:13.0098187Z if contiguous: 2025-05-07T20:33:13.0098274Z x0 = x0.contiguous() 2025-05-07T20:33:13.0098358Z x1 = x1.contiguous() 2025-05-07T20:33:13.0098427Z 2025-05-07T20:33:13.0098513Z if scale_ub is not None: 2025-05-07T20:33:13.0098612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0098796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0098865Z ) 2025-05-07T20:33:13.0098943Z else: 2025-05-07T20:33:13.0099033Z scale_ub_tensor = None 2025-05-07T20:33:13.0099103Z 2025-05-07T20:33:13.0099229Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0099315Z op = silu_mul_quant 2025-05-07T20:33:13.0099398Z if compiled: 2025-05-07T20:33:13.0099496Z op = torch.compile(op) 2025-05-07T20:33:13.0099601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0099670Z 2025-05-07T20:33:13.0099760Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0099765Z 2025-05-07T20:33:13.0099860Z moe/activation_test.py:117: 2025-05-07T20:33:13.0099988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0100094Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0100254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0100776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0100870Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0101220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0101438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0101831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0101923Z kernel = self.compile( 2025-05-07T20:33:13.0102296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0102466Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0102594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0102638Z 2025-05-07T20:33:13.0102836Z self = 2025-05-07T20:33:13.0103594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0104090Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552ea4220>} 2025-05-07T20:33:13.0104816Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0105003Z context = 2025-05-07T20:33:13.0105007Z 2025-05-07T20:33:13.0105172Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0105429Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0105531Z module_map=module_map) 2025-05-07T20:33:13.0105687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0105784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0105862Z E ^ 2025-05-07T20:33:13.0106206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0106211Z 2025-05-07T20:33:13.0106617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0106621Z 2025-05-07T20:33:13.0106721Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0106939Z self=, 2025-05-07T20:33:13.0107062Z T=1, 2025-05-07T20:33:13.0107140Z D=7168, 2025-05-07T20:33:13.0107220Z scale_ub=None, 2025-05-07T20:33:13.0107302Z contiguous=True, 2025-05-07T20:33:13.0107381Z compiled=False, 2025-05-07T20:33:13.0107455Z ) 2025-05-07T20:33:13.0107665Z self = 2025-05-07T20:33:13.0107826Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:13.0107833Z 2025-05-07T20:33:13.0107905Z @given( 2025-05-07T20:33:13.0108019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0108117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0108226Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0108339Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0108451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0108521Z ) 2025-05-07T20:33:13.0108802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0108903Z def test_silu_mul_quant( 2025-05-07T20:33:13.0108975Z self, 2025-05-07T20:33:13.0109053Z T: int, 2025-05-07T20:33:13.0109128Z D: int, 2025-05-07T20:33:13.0109221Z scale_ub: Optional[float], 2025-05-07T20:33:13.0109309Z contiguous: bool, 2025-05-07T20:33:13.0109391Z compiled: bool, 2025-05-07T20:33:13.0109462Z ) -> None: 2025-05-07T20:33:13.0109604Z torch.manual_seed(2025) 2025-05-07T20:33:13.0109674Z 2025-05-07T20:33:13.0109837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0109912Z 2025-05-07T20:33:13.0109999Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0110119Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0110211Z x = x_sign * x_clamp 2025-05-07T20:33:13.0110286Z x0 = x[:, :D] 2025-05-07T20:33:13.0110366Z x1 = x[:, D:] 2025-05-07T20:33:13.0110453Z 2025-05-07T20:33:13.0110590Z if contiguous: 2025-05-07T20:33:13.0110696Z x0 = x0.contiguous() 2025-05-07T20:33:13.0110782Z x1 = x1.contiguous() 2025-05-07T20:33:13.0110852Z 2025-05-07T20:33:13.0110938Z if scale_ub is not None: 2025-05-07T20:33:13.0111039Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0111168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0111246Z ) 2025-05-07T20:33:13.0111320Z else: 2025-05-07T20:33:13.0111412Z scale_ub_tensor = None 2025-05-07T20:33:13.0111485Z 2025-05-07T20:33:13.0111610Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0111696Z op = silu_mul_quant 2025-05-07T20:33:13.0111779Z if compiled: 2025-05-07T20:33:13.0111874Z op = torch.compile(op) 2025-05-07T20:33:13.0111978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0112045Z 2025-05-07T20:33:13.0112133Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0112142Z 2025-05-07T20:33:13.0112238Z moe/activation_test.py:117: 2025-05-07T20:33:13.0112362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0112457Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0112554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0113040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0113139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0113489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0113704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0114037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0114127Z kernel = self.compile( 2025-05-07T20:33:13.0114549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0114719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0114842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0114846Z 2025-05-07T20:33:13.0115047Z self = 2025-05-07T20:33:13.0115809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0116297Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552ea5120>} 2025-05-07T20:33:13.0117084Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0117272Z context = 2025-05-07T20:33:13.0117276Z 2025-05-07T20:33:13.0117438Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0117731Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0117833Z module_map=module_map) 2025-05-07T20:33:13.0117993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0118092Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0118166Z E ^ 2025-05-07T20:33:13.0118512Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0118517Z 2025-05-07T20:33:13.0118958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0118966Z 2025-05-07T20:33:13.0119067Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0119282Z self=, 2025-05-07T20:33:13.0119358Z T=16384, 2025-05-07T20:33:13.0119429Z D=7168, 2025-05-07T20:33:13.0119512Z scale_ub=1200.0, 2025-05-07T20:33:13.0119599Z contiguous=False, 2025-05-07T20:33:13.0119678Z compiled=True, 2025-05-07T20:33:13.0119746Z ) 2025-05-07T20:33:13.0119960Z self = 2025-05-07T20:33:13.0120132Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:13.0120136Z 2025-05-07T20:33:13.0120211Z @given( 2025-05-07T20:33:13.0120329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0120424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0120543Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0120655Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0120764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0120838Z ) 2025-05-07T20:33:13.0121076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0121164Z def test_silu_mul_quant( 2025-05-07T20:33:13.0121244Z self, 2025-05-07T20:33:13.0121318Z T: int, 2025-05-07T20:33:13.0121394Z D: int, 2025-05-07T20:33:13.0121492Z scale_ub: Optional[float], 2025-05-07T20:33:13.0121578Z contiguous: bool, 2025-05-07T20:33:13.0121659Z compiled: bool, 2025-05-07T20:33:13.0121737Z ) -> None: 2025-05-07T20:33:13.0121827Z torch.manual_seed(2025) 2025-05-07T20:33:13.0121896Z 2025-05-07T20:33:13.0122059Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0122127Z 2025-05-07T20:33:13.0122268Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0122391Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0122474Z x = x_sign * x_clamp 2025-05-07T20:33:13.0122555Z x0 = x[:, :D] 2025-05-07T20:33:13.0122630Z x1 = x[:, D:] 2025-05-07T20:33:13.0122699Z 2025-05-07T20:33:13.0122783Z if contiguous: 2025-05-07T20:33:13.0122869Z x0 = x0.contiguous() 2025-05-07T20:33:13.0122953Z x1 = x1.contiguous() 2025-05-07T20:33:13.0123025Z 2025-05-07T20:33:13.0123111Z if scale_ub is not None: 2025-05-07T20:33:13.0123213Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0123343Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0123414Z ) 2025-05-07T20:33:13.0123490Z else: 2025-05-07T20:33:13.0123579Z scale_ub_tensor = None 2025-05-07T20:33:13.0123648Z 2025-05-07T20:33:13.0123821Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0123913Z op = silu_mul_quant 2025-05-07T20:33:13.0123992Z if compiled: 2025-05-07T20:33:13.0124093Z op = torch.compile(op) 2025-05-07T20:33:13.0124195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0124263Z 2025-05-07T20:33:13.0124352Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0124356Z 2025-05-07T20:33:13.0124448Z moe/activation_test.py:117: 2025-05-07T20:33:13.0124617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0124713Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0124810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0125171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0125259Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0125782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0125886Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0126234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0126454Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0126783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0126873Z kernel = self.compile( 2025-05-07T20:33:13.0127246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0127417Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0127539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0127547Z 2025-05-07T20:33:13.0127748Z self = 2025-05-07T20:33:13.0128508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0129005Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552ea6520>} 2025-05-07T20:33:13.0129735Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0129921Z context = 2025-05-07T20:33:13.0129925Z 2025-05-07T20:33:13.0130087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0130387Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0130538Z module_map=module_map) 2025-05-07T20:33:13.0130695Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0130793Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0130869Z E ^ 2025-05-07T20:33:13.0131213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0131221Z 2025-05-07T20:33:13.0131626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0131630Z 2025-05-07T20:33:13.0131727Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0131941Z self=, 2025-05-07T20:33:13.0132016Z T=1, 2025-05-07T20:33:13.0132092Z D=7168, 2025-05-07T20:33:13.0132172Z scale_ub=None, 2025-05-07T20:33:13.0132323Z contiguous=False, 2025-05-07T20:33:13.0132410Z compiled=False, 2025-05-07T20:33:13.0132485Z ) 2025-05-07T20:33:13.0132698Z self = 2025-05-07T20:33:13.0132859Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:13.0132864Z 2025-05-07T20:33:13.0132940Z @given( 2025-05-07T20:33:13.0133103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0133243Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0133355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0133466Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0133582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0133654Z ) 2025-05-07T20:33:13.0133893Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0133983Z def test_silu_mul_quant( 2025-05-07T20:33:13.0134061Z self, 2025-05-07T20:33:13.0134176Z T: int, 2025-05-07T20:33:13.0134255Z D: int, 2025-05-07T20:33:13.0134349Z scale_ub: Optional[float], 2025-05-07T20:33:13.0134433Z contiguous: bool, 2025-05-07T20:33:13.0134518Z compiled: bool, 2025-05-07T20:33:13.0134596Z ) -> None: 2025-05-07T20:33:13.0134688Z torch.manual_seed(2025) 2025-05-07T20:33:13.0134761Z 2025-05-07T20:33:13.0134927Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0135002Z 2025-05-07T20:33:13.0135091Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0135214Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0135305Z x = x_sign * x_clamp 2025-05-07T20:33:13.0135380Z x0 = x[:, :D] 2025-05-07T20:33:13.0135454Z x1 = x[:, D:] 2025-05-07T20:33:13.0135523Z 2025-05-07T20:33:13.0135602Z if contiguous: 2025-05-07T20:33:13.0135689Z x0 = x0.contiguous() 2025-05-07T20:33:13.0135782Z x1 = x1.contiguous() 2025-05-07T20:33:13.0135856Z 2025-05-07T20:33:13.0135941Z if scale_ub is not None: 2025-05-07T20:33:13.0136046Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0136174Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0136243Z ) 2025-05-07T20:33:13.0136316Z else: 2025-05-07T20:33:13.0136406Z scale_ub_tensor = None 2025-05-07T20:33:13.0136481Z 2025-05-07T20:33:13.0136606Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0136691Z op = silu_mul_quant 2025-05-07T20:33:13.0136773Z if compiled: 2025-05-07T20:33:13.0136869Z op = torch.compile(op) 2025-05-07T20:33:13.0136972Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0137039Z 2025-05-07T20:33:13.0137127Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0137131Z 2025-05-07T20:33:13.0137224Z moe/activation_test.py:117: 2025-05-07T20:33:13.0137355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0137514Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0137610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0138096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0138188Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0138542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0138756Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0139086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0139178Z kernel = self.compile( 2025-05-07T20:33:13.0139591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0139767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0139888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0139893Z 2025-05-07T20:33:13.0140091Z self = 2025-05-07T20:33:13.0140901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0141431Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552ea7100>} 2025-05-07T20:33:13.0142200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0142388Z context = 2025-05-07T20:33:13.0142393Z 2025-05-07T20:33:13.0142556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0142809Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0142914Z module_map=module_map) 2025-05-07T20:33:13.0143073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0143168Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0143242Z E ^ 2025-05-07T20:33:13.0143592Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0143597Z 2025-05-07T20:33:13.0143998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0144007Z 2025-05-07T20:33:13.0144110Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0144326Z self=, 2025-05-07T20:33:13.0144399Z T=2048, 2025-05-07T20:33:13.0144472Z D=7168, 2025-05-07T20:33:13.0144549Z scale_ub=None, 2025-05-07T20:33:13.0144631Z contiguous=False, 2025-05-07T20:33:13.0144710Z compiled=True, 2025-05-07T20:33:13.0144781Z ) 2025-05-07T20:33:13.0144993Z self = 2025-05-07T20:33:13.0145166Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:13.0145170Z 2025-05-07T20:33:13.0145241Z @given( 2025-05-07T20:33:13.0145358Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0145452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0145562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0145682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0145840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0145913Z ) 2025-05-07T20:33:13.0146156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0146246Z def test_silu_mul_quant( 2025-05-07T20:33:13.0146323Z self, 2025-05-07T20:33:13.0146398Z T: int, 2025-05-07T20:33:13.0146472Z D: int, 2025-05-07T20:33:13.0146572Z scale_ub: Optional[float], 2025-05-07T20:33:13.0146658Z contiguous: bool, 2025-05-07T20:33:13.0146739Z compiled: bool, 2025-05-07T20:33:13.0146817Z ) -> None: 2025-05-07T20:33:13.0146907Z torch.manual_seed(2025) 2025-05-07T20:33:13.0146974Z 2025-05-07T20:33:13.0147139Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0147206Z 2025-05-07T20:33:13.0147292Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0147457Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0147547Z x = x_sign * x_clamp 2025-05-07T20:33:13.0147624Z x0 = x[:, :D] 2025-05-07T20:33:13.0147700Z x1 = x[:, D:] 2025-05-07T20:33:13.0147768Z 2025-05-07T20:33:13.0147852Z if contiguous: 2025-05-07T20:33:13.0147938Z x0 = x0.contiguous() 2025-05-07T20:33:13.0148024Z x1 = x1.contiguous() 2025-05-07T20:33:13.0148094Z 2025-05-07T20:33:13.0148224Z if scale_ub is not None: 2025-05-07T20:33:13.0148325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0148456Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0148523Z ) 2025-05-07T20:33:13.0148595Z else: 2025-05-07T20:33:13.0148688Z scale_ub_tensor = None 2025-05-07T20:33:13.0148756Z 2025-05-07T20:33:13.0148883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0148968Z op = silu_mul_quant 2025-05-07T20:33:13.0149048Z if compiled: 2025-05-07T20:33:13.0149189Z op = torch.compile(op) 2025-05-07T20:33:13.0149292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0149361Z 2025-05-07T20:33:13.0149447Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0149452Z 2025-05-07T20:33:13.0149544Z moe/activation_test.py:117: 2025-05-07T20:33:13.0149672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0149777Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0149873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0150231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0150323Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0150807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0150901Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0151253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0151471Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0151803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0151893Z kernel = self.compile( 2025-05-07T20:33:13.0152271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0152439Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0152562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0152566Z 2025-05-07T20:33:13.0152766Z self = 2025-05-07T20:33:13.0153524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0154062Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1edc720>} 2025-05-07T20:33:13.0154792Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0154978Z context = 2025-05-07T20:33:13.0154982Z 2025-05-07T20:33:13.0155147Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0155398Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0155544Z module_map=module_map) 2025-05-07T20:33:13.0155708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0155804Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0155878Z E ^ 2025-05-07T20:33:13.0156221Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0156226Z 2025-05-07T20:33:13.0160070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0160192Z 2025-05-07T20:33:13.0160313Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0160540Z self=, 2025-05-07T20:33:13.0160612Z T=4096, 2025-05-07T20:33:13.0160682Z D=7168, 2025-05-07T20:33:13.0160760Z scale_ub=None, 2025-05-07T20:33:13.0160842Z contiguous=False, 2025-05-07T20:33:13.0160921Z compiled=True, 2025-05-07T20:33:13.0160993Z ) 2025-05-07T20:33:13.0161272Z self = 2025-05-07T20:33:13.0161452Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:13.0161457Z 2025-05-07T20:33:13.0161529Z @given( 2025-05-07T20:33:13.0161646Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0161744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0161857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0161971Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0162086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0162159Z ) 2025-05-07T20:33:13.0162402Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0162496Z def test_silu_mul_quant( 2025-05-07T20:33:13.0162568Z self, 2025-05-07T20:33:13.0162647Z T: int, 2025-05-07T20:33:13.0162718Z D: int, 2025-05-07T20:33:13.0162815Z scale_ub: Optional[float], 2025-05-07T20:33:13.0162909Z contiguous: bool, 2025-05-07T20:33:13.0162989Z compiled: bool, 2025-05-07T20:33:13.0163063Z ) -> None: 2025-05-07T20:33:13.0163155Z torch.manual_seed(2025) 2025-05-07T20:33:13.0163223Z 2025-05-07T20:33:13.0163389Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0163465Z 2025-05-07T20:33:13.0163556Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0163682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0163769Z x = x_sign * x_clamp 2025-05-07T20:33:13.0163847Z x0 = x[:, :D] 2025-05-07T20:33:13.0163925Z x1 = x[:, D:] 2025-05-07T20:33:13.0163994Z 2025-05-07T20:33:13.0164072Z if contiguous: 2025-05-07T20:33:13.0164162Z x0 = x0.contiguous() 2025-05-07T20:33:13.0164248Z x1 = x1.contiguous() 2025-05-07T20:33:13.0164318Z 2025-05-07T20:33:13.0164408Z if scale_ub is not None: 2025-05-07T20:33:13.0164604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0164735Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0164807Z ) 2025-05-07T20:33:13.0164880Z else: 2025-05-07T20:33:13.0164971Z scale_ub_tensor = None 2025-05-07T20:33:13.0165039Z 2025-05-07T20:33:13.0165164Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0165255Z op = silu_mul_quant 2025-05-07T20:33:13.0165339Z if compiled: 2025-05-07T20:33:13.0165437Z op = torch.compile(op) 2025-05-07T20:33:13.0165540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0165607Z 2025-05-07T20:33:13.0165695Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0165699Z 2025-05-07T20:33:13.0165793Z moe/activation_test.py:117: 2025-05-07T20:33:13.0165918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0166075Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0166183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0166549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0166642Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0167127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0167262Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0167614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0167832Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0168163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0168256Z kernel = self.compile( 2025-05-07T20:33:13.0168678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0168854Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0168977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0168981Z 2025-05-07T20:33:13.0169183Z self = 2025-05-07T20:33:13.0169947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0170494Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1edd440>} 2025-05-07T20:33:13.0171233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0171421Z context = 2025-05-07T20:33:13.0171425Z 2025-05-07T20:33:13.0171588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0171843Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0171952Z module_map=module_map) 2025-05-07T20:33:13.0172114Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0172208Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0172280Z E ^ 2025-05-07T20:33:13.0172628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0172632Z 2025-05-07T20:33:13.0173103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0173154Z 2025-05-07T20:33:13.0173260Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0173480Z self=, 2025-05-07T20:33:13.0173553Z T=16384, 2025-05-07T20:33:13.0173628Z D=5120, 2025-05-07T20:33:13.0173705Z scale_ub=1200.0, 2025-05-07T20:33:13.0173790Z contiguous=False, 2025-05-07T20:33:13.0173879Z compiled=False, 2025-05-07T20:33:13.0173950Z ) 2025-05-07T20:33:13.0174160Z self = 2025-05-07T20:33:13.0174345Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:13.0174350Z 2025-05-07T20:33:13.0174425Z @given( 2025-05-07T20:33:13.0174544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0174641Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0174795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0174918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0175028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0175099Z ) 2025-05-07T20:33:13.0175342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0175429Z def test_silu_mul_quant( 2025-05-07T20:33:13.0175507Z self, 2025-05-07T20:33:13.0175629Z T: int, 2025-05-07T20:33:13.0175702Z D: int, 2025-05-07T20:33:13.0175803Z scale_ub: Optional[float], 2025-05-07T20:33:13.0175891Z contiguous: bool, 2025-05-07T20:33:13.0175975Z compiled: bool, 2025-05-07T20:33:13.0176055Z ) -> None: 2025-05-07T20:33:13.0176146Z torch.manual_seed(2025) 2025-05-07T20:33:13.0176217Z 2025-05-07T20:33:13.0176383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0176456Z 2025-05-07T20:33:13.0176542Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0176718Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0176806Z x = x_sign * x_clamp 2025-05-07T20:33:13.0176885Z x0 = x[:, :D] 2025-05-07T20:33:13.0176966Z x1 = x[:, D:] 2025-05-07T20:33:13.0177035Z 2025-05-07T20:33:13.0177122Z if contiguous: 2025-05-07T20:33:13.0177209Z x0 = x0.contiguous() 2025-05-07T20:33:13.0177294Z x1 = x1.contiguous() 2025-05-07T20:33:13.0177374Z 2025-05-07T20:33:13.0177461Z if scale_ub is not None: 2025-05-07T20:33:13.0177563Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0177698Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0177769Z ) 2025-05-07T20:33:13.0177840Z else: 2025-05-07T20:33:13.0177933Z scale_ub_tensor = None 2025-05-07T20:33:13.0178000Z 2025-05-07T20:33:13.0178126Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0178218Z op = silu_mul_quant 2025-05-07T20:33:13.0178305Z if compiled: 2025-05-07T20:33:13.0178403Z op = torch.compile(op) 2025-05-07T20:33:13.0178510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0178578Z 2025-05-07T20:33:13.0178669Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0178674Z 2025-05-07T20:33:13.0178767Z moe/activation_test.py:117: 2025-05-07T20:33:13.0178892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0178998Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0179096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0179589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0179687Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0180036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0180325Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0180691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0180781Z kernel = self.compile( 2025-05-07T20:33:13.0181155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0181327Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0181452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0181457Z 2025-05-07T20:33:13.0181655Z self = 2025-05-07T20:33:13.0182537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0183039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1ede340>} 2025-05-07T20:33:13.0183767Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0183995Z context = 2025-05-07T20:33:13.0184000Z 2025-05-07T20:33:13.0184161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0184414Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0184525Z module_map=module_map) 2025-05-07T20:33:13.0184685Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0184830Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0184908Z E ^ 2025-05-07T20:33:13.0185255Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0185259Z 2025-05-07T20:33:13.0185665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0185672Z 2025-05-07T20:33:13.0185774Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0185995Z self=, 2025-05-07T20:33:13.0186069Z T=16384, 2025-05-07T20:33:13.0186143Z D=5120, 2025-05-07T20:33:13.0186228Z scale_ub=1200.0, 2025-05-07T20:33:13.0186310Z contiguous=True, 2025-05-07T20:33:13.0186389Z compiled=True, 2025-05-07T20:33:13.0186463Z ) 2025-05-07T20:33:13.0186677Z self = 2025-05-07T20:33:13.0186852Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:13.0186859Z 2025-05-07T20:33:13.0186937Z @given( 2025-05-07T20:33:13.0187055Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0187150Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0187265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0187379Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0187495Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0187565Z ) 2025-05-07T20:33:13.0187804Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0187895Z def test_silu_mul_quant( 2025-05-07T20:33:13.0187967Z self, 2025-05-07T20:33:13.0188042Z T: int, 2025-05-07T20:33:13.0188121Z D: int, 2025-05-07T20:33:13.0188220Z scale_ub: Optional[float], 2025-05-07T20:33:13.0188309Z contiguous: bool, 2025-05-07T20:33:13.0188398Z compiled: bool, 2025-05-07T20:33:13.0188521Z ) -> None: 2025-05-07T20:33:13.0188613Z torch.manual_seed(2025) 2025-05-07T20:33:13.0188686Z 2025-05-07T20:33:13.0188850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0188926Z 2025-05-07T20:33:13.0189014Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0189137Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0189232Z x = x_sign * x_clamp 2025-05-07T20:33:13.0189308Z x0 = x[:, :D] 2025-05-07T20:33:13.0189385Z x1 = x[:, D:] 2025-05-07T20:33:13.0189459Z 2025-05-07T20:33:13.0189540Z if contiguous: 2025-05-07T20:33:13.0189627Z x0 = x0.contiguous() 2025-05-07T20:33:13.0189718Z x1 = x1.contiguous() 2025-05-07T20:33:13.0189788Z 2025-05-07T20:33:13.0189876Z if scale_ub is not None: 2025-05-07T20:33:13.0189982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0190156Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0190236Z ) 2025-05-07T20:33:13.0190310Z else: 2025-05-07T20:33:13.0190402Z scale_ub_tensor = None 2025-05-07T20:33:13.0190476Z 2025-05-07T20:33:13.0190603Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0190690Z op = silu_mul_quant 2025-05-07T20:33:13.0190775Z if compiled: 2025-05-07T20:33:13.0190914Z op = torch.compile(op) 2025-05-07T20:33:13.0191014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0191088Z 2025-05-07T20:33:13.0191174Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0191178Z 2025-05-07T20:33:13.0191270Z moe/activation_test.py:117: 2025-05-07T20:33:13.0191398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0191496Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0191596Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0192023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0192115Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0192598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0192691Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0193040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0193263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0193599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0193695Z kernel = self.compile( 2025-05-07T20:33:13.0194066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0194240Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0194368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0194372Z 2025-05-07T20:33:13.0194572Z self = 2025-05-07T20:33:13.0195331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0195825Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1edf9c0>} 2025-05-07T20:33:13.0196556Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0196789Z context = 2025-05-07T20:33:13.0196794Z 2025-05-07T20:33:13.0196953Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0197210Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0197313Z module_map=module_map) 2025-05-07T20:33:13.0197471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0197570Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0197646Z E ^ 2025-05-07T20:33:13.0197994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0197999Z 2025-05-07T20:33:13.0198401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0198405Z 2025-05-07T20:33:13.0198547Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0198770Z self=, 2025-05-07T20:33:13.0198846Z T=16384, 2025-05-07T20:33:13.0198924Z D=5120, 2025-05-07T20:33:13.0199004Z scale_ub=None, 2025-05-07T20:33:13.0199089Z contiguous=False, 2025-05-07T20:33:13.0199172Z compiled=True, 2025-05-07T20:33:13.0199242Z ) 2025-05-07T20:33:13.0199455Z self = 2025-05-07T20:33:13.0199669Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:13.0199674Z 2025-05-07T20:33:13.0199746Z @given( 2025-05-07T20:33:13.0199860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0199960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0200073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0200188Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0200365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0200450Z ) 2025-05-07T20:33:13.0200706Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0200795Z def test_silu_mul_quant( 2025-05-07T20:33:13.0200866Z self, 2025-05-07T20:33:13.0200949Z T: int, 2025-05-07T20:33:13.0201023Z D: int, 2025-05-07T20:33:13.0201119Z scale_ub: Optional[float], 2025-05-07T20:33:13.0201213Z contiguous: bool, 2025-05-07T20:33:13.0201302Z compiled: bool, 2025-05-07T20:33:13.0201376Z ) -> None: 2025-05-07T20:33:13.0201473Z torch.manual_seed(2025) 2025-05-07T20:33:13.0201539Z 2025-05-07T20:33:13.0201701Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0201774Z 2025-05-07T20:33:13.0201862Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0201985Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0202074Z x = x_sign * x_clamp 2025-05-07T20:33:13.0202156Z x0 = x[:, :D] 2025-05-07T20:33:13.0202237Z x1 = x[:, D:] 2025-05-07T20:33:13.0202306Z 2025-05-07T20:33:13.0202387Z if contiguous: 2025-05-07T20:33:13.0202477Z x0 = x0.contiguous() 2025-05-07T20:33:13.0202562Z x1 = x1.contiguous() 2025-05-07T20:33:13.0202632Z 2025-05-07T20:33:13.0202723Z if scale_ub is not None: 2025-05-07T20:33:13.0202828Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0202957Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0203033Z ) 2025-05-07T20:33:13.0203106Z else: 2025-05-07T20:33:13.0203199Z scale_ub_tensor = None 2025-05-07T20:33:13.0203266Z 2025-05-07T20:33:13.0203390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0203480Z op = silu_mul_quant 2025-05-07T20:33:13.0203562Z if compiled: 2025-05-07T20:33:13.0203660Z op = torch.compile(op) 2025-05-07T20:33:13.0203816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0203888Z 2025-05-07T20:33:13.0203974Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0203978Z 2025-05-07T20:33:13.0204074Z moe/activation_test.py:117: 2025-05-07T20:33:13.0204197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0204298Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0204396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0204755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0204848Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0205330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0205425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0205824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0206044Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0206377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0206466Z kernel = self.compile( 2025-05-07T20:33:13.0206841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0207053Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0207180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0207184Z 2025-05-07T20:33:13.0207386Z self = 2025-05-07T20:33:13.0208185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0208680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1c00c20>} 2025-05-07T20:33:13.0209408Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0209597Z context = 2025-05-07T20:33:13.0209601Z 2025-05-07T20:33:13.0209764Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0210020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0210125Z module_map=module_map) 2025-05-07T20:33:13.0210291Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0210414Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0210491Z E ^ 2025-05-07T20:33:13.0210863Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0210867Z 2025-05-07T20:33:13.0211270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0211277Z 2025-05-07T20:33:13.0211377Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0211596Z self=, 2025-05-07T20:33:13.0211667Z T=2048, 2025-05-07T20:33:13.0211739Z D=5120, 2025-05-07T20:33:13.0211820Z scale_ub=None, 2025-05-07T20:33:13.0211908Z contiguous=False, 2025-05-07T20:33:13.0211990Z compiled=True, 2025-05-07T20:33:13.0212059Z ) 2025-05-07T20:33:13.0212276Z self = 2025-05-07T20:33:13.0212495Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:13.0212499Z 2025-05-07T20:33:13.0212576Z @given( 2025-05-07T20:33:13.0212690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0212789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0212899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0213092Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0213204Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0213272Z ) 2025-05-07T20:33:13.0213515Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0213604Z def test_silu_mul_quant( 2025-05-07T20:33:13.0213676Z self, 2025-05-07T20:33:13.0213752Z T: int, 2025-05-07T20:33:13.0213823Z D: int, 2025-05-07T20:33:13.0213918Z scale_ub: Optional[float], 2025-05-07T20:33:13.0214049Z contiguous: bool, 2025-05-07T20:33:13.0214139Z compiled: bool, 2025-05-07T20:33:13.0214214Z ) -> None: 2025-05-07T20:33:13.0214307Z torch.manual_seed(2025) 2025-05-07T20:33:13.0214378Z 2025-05-07T20:33:13.0214542Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0214613Z 2025-05-07T20:33:13.0214701Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0214869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0214956Z x = x_sign * x_clamp 2025-05-07T20:33:13.0215031Z x0 = x[:, :D] 2025-05-07T20:33:13.0215112Z x1 = x[:, D:] 2025-05-07T20:33:13.0215179Z 2025-05-07T20:33:13.0215259Z if contiguous: 2025-05-07T20:33:13.0215348Z x0 = x0.contiguous() 2025-05-07T20:33:13.0215434Z x1 = x1.contiguous() 2025-05-07T20:33:13.0215503Z 2025-05-07T20:33:13.0215590Z if scale_ub is not None: 2025-05-07T20:33:13.0215699Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0215873Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0215948Z ) 2025-05-07T20:33:13.0216019Z else: 2025-05-07T20:33:13.0216111Z scale_ub_tensor = None 2025-05-07T20:33:13.0216186Z 2025-05-07T20:33:13.0216312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0216398Z op = silu_mul_quant 2025-05-07T20:33:13.0216486Z if compiled: 2025-05-07T20:33:13.0216582Z op = torch.compile(op) 2025-05-07T20:33:13.0216685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0216751Z 2025-05-07T20:33:13.0216836Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0216841Z 2025-05-07T20:33:13.0216935Z moe/activation_test.py:117: 2025-05-07T20:33:13.0217058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0217153Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0217259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0217620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0217707Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0218190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0218282Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0218633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0218848Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0219178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0219272Z kernel = self.compile( 2025-05-07T20:33:13.0219651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0219867Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0219989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0219993Z 2025-05-07T20:33:13.0220191Z self = 2025-05-07T20:33:13.0220953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0221447Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1c019e0>} 2025-05-07T20:33:13.0222226Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0222413Z context = 2025-05-07T20:33:13.0222418Z 2025-05-07T20:33:13.0222577Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0222835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0223001Z module_map=module_map) 2025-05-07T20:33:13.0223161Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0223255Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0223328Z E ^ 2025-05-07T20:33:13.0223677Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0223682Z 2025-05-07T20:33:13.0224084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0224134Z 2025-05-07T20:33:13.0224236Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0224453Z self=, 2025-05-07T20:33:13.0224526Z T=2048, 2025-05-07T20:33:13.0224602Z D=5120, 2025-05-07T20:33:13.0224681Z scale_ub=1200.0, 2025-05-07T20:33:13.0224763Z contiguous=False, 2025-05-07T20:33:13.0224844Z compiled=True, 2025-05-07T20:33:13.0224915Z ) 2025-05-07T20:33:13.0225127Z self = 2025-05-07T20:33:13.0225297Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:13.0225302Z 2025-05-07T20:33:13.0225377Z @given( 2025-05-07T20:33:13.0225499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0225592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0225704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0225823Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0225934Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0226005Z ) 2025-05-07T20:33:13.0226245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0226331Z def test_silu_mul_quant( 2025-05-07T20:33:13.0226403Z self, 2025-05-07T20:33:13.0226478Z T: int, 2025-05-07T20:33:13.0226556Z D: int, 2025-05-07T20:33:13.0226653Z scale_ub: Optional[float], 2025-05-07T20:33:13.0226740Z contiguous: bool, 2025-05-07T20:33:13.0226822Z compiled: bool, 2025-05-07T20:33:13.0226899Z ) -> None: 2025-05-07T20:33:13.0226989Z torch.manual_seed(2025) 2025-05-07T20:33:13.0227060Z 2025-05-07T20:33:13.0227225Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0227297Z 2025-05-07T20:33:13.0227382Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0227512Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0227643Z x = x_sign * x_clamp 2025-05-07T20:33:13.0227718Z x0 = x[:, :D] 2025-05-07T20:33:13.0227801Z x1 = x[:, D:] 2025-05-07T20:33:13.0227868Z 2025-05-07T20:33:13.0227947Z if contiguous: 2025-05-07T20:33:13.0228035Z x0 = x0.contiguous() 2025-05-07T20:33:13.0228119Z x1 = x1.contiguous() 2025-05-07T20:33:13.0228190Z 2025-05-07T20:33:13.0228278Z if scale_ub is not None: 2025-05-07T20:33:13.0228380Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0228512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0228582Z ) 2025-05-07T20:33:13.0228654Z else: 2025-05-07T20:33:13.0228749Z scale_ub_tensor = None 2025-05-07T20:33:13.0228818Z 2025-05-07T20:33:13.0228941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0229028Z op = silu_mul_quant 2025-05-07T20:33:13.0229151Z if compiled: 2025-05-07T20:33:13.0229252Z op = torch.compile(op) 2025-05-07T20:33:13.0229358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0229426Z 2025-05-07T20:33:13.0229514Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0229519Z 2025-05-07T20:33:13.0229611Z moe/activation_test.py:117: 2025-05-07T20:33:13.0229734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0229872Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0229971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0230375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0230468Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0230949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0231047Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0231443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0231661Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0231998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0232087Z kernel = self.compile( 2025-05-07T20:33:13.0232464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0232637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0232759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0232763Z 2025-05-07T20:33:13.0232965Z self = 2025-05-07T20:33:13.0233724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0234223Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1c02b60>} 2025-05-07T20:33:13.0234950Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0235137Z context = 2025-05-07T20:33:13.0235142Z 2025-05-07T20:33:13.0235304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0235555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0235662Z module_map=module_map) 2025-05-07T20:33:13.0235864Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0235957Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0236035Z E ^ 2025-05-07T20:33:13.0236380Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0236385Z 2025-05-07T20:33:13.0236786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0236796Z 2025-05-07T20:33:13.0236892Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0237108Z self=, 2025-05-07T20:33:13.0237185Z T=4096, 2025-05-07T20:33:13.0237255Z D=5120, 2025-05-07T20:33:13.0237334Z scale_ub=1200.0, 2025-05-07T20:33:13.0237420Z contiguous=True, 2025-05-07T20:33:13.0237500Z compiled=True, 2025-05-07T20:33:13.0237569Z ) 2025-05-07T20:33:13.0237832Z self = 2025-05-07T20:33:13.0237998Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:13.0238002Z 2025-05-07T20:33:13.0238077Z @given( 2025-05-07T20:33:13.0238195Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0238291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0238448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0238559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0238670Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0238741Z ) 2025-05-07T20:33:13.0238978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0239068Z def test_silu_mul_quant( 2025-05-07T20:33:13.0239146Z self, 2025-05-07T20:33:13.0239217Z T: int, 2025-05-07T20:33:13.0239291Z D: int, 2025-05-07T20:33:13.0239430Z scale_ub: Optional[float], 2025-05-07T20:33:13.0239520Z contiguous: bool, 2025-05-07T20:33:13.0239603Z compiled: bool, 2025-05-07T20:33:13.0239676Z ) -> None: 2025-05-07T20:33:13.0239768Z torch.manual_seed(2025) 2025-05-07T20:33:13.0239840Z 2025-05-07T20:33:13.0240005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0240072Z 2025-05-07T20:33:13.0240166Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0240291Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0240377Z x = x_sign * x_clamp 2025-05-07T20:33:13.0240470Z x0 = x[:, :D] 2025-05-07T20:33:13.0240558Z x1 = x[:, D:] 2025-05-07T20:33:13.0240640Z 2025-05-07T20:33:13.0240734Z if contiguous: 2025-05-07T20:33:13.0240822Z x0 = x0.contiguous() 2025-05-07T20:33:13.0240905Z x1 = x1.contiguous() 2025-05-07T20:33:13.0240973Z 2025-05-07T20:33:13.0241061Z if scale_ub is not None: 2025-05-07T20:33:13.0241170Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0241299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0241368Z ) 2025-05-07T20:33:13.0241446Z else: 2025-05-07T20:33:13.0241535Z scale_ub_tensor = None 2025-05-07T20:33:13.0241604Z 2025-05-07T20:33:13.0241734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0241824Z op = silu_mul_quant 2025-05-07T20:33:13.0241903Z if compiled: 2025-05-07T20:33:13.0242004Z op = torch.compile(op) 2025-05-07T20:33:13.0242105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0242174Z 2025-05-07T20:33:13.0242263Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0242267Z 2025-05-07T20:33:13.0242358Z moe/activation_test.py:117: 2025-05-07T20:33:13.0242487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0242588Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0242733Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0243095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0243183Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0243662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0243763Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0244110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0244328Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0244657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0244744Z kernel = self.compile( 2025-05-07T20:33:13.0245160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0245333Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0245459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0245464Z 2025-05-07T20:33:13.0245663Z self = 2025-05-07T20:33:13.0246463Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0246957Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1978180>} 2025-05-07T20:33:13.0247726Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0247916Z context = 2025-05-07T20:33:13.0247921Z 2025-05-07T20:33:13.0248078Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0248331Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0248440Z module_map=module_map) 2025-05-07T20:33:13.0248595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0248695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0248767Z E ^ 2025-05-07T20:33:13.0249111Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0249115Z 2025-05-07T20:33:13.0249525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0249532Z 2025-05-07T20:33:13.0249632Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0249851Z self=, 2025-05-07T20:33:13.0249926Z T=128, 2025-05-07T20:33:13.0249998Z D=5120, 2025-05-07T20:33:13.0250086Z scale_ub=1200.0, 2025-05-07T20:33:13.0250188Z contiguous=False, 2025-05-07T20:33:13.0250276Z compiled=True, 2025-05-07T20:33:13.0250361Z ) 2025-05-07T20:33:13.0250573Z self = 2025-05-07T20:33:13.0250738Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:13.0250742Z 2025-05-07T20:33:13.0250820Z @given( 2025-05-07T20:33:13.0250934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0251028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0251142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0251303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0251417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0251491Z ) 2025-05-07T20:33:13.0251730Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0251827Z def test_silu_mul_quant( 2025-05-07T20:33:13.0251903Z self, 2025-05-07T20:33:13.0251979Z T: int, 2025-05-07T20:33:13.0252056Z D: int, 2025-05-07T20:33:13.0252153Z scale_ub: Optional[float], 2025-05-07T20:33:13.0252237Z contiguous: bool, 2025-05-07T20:33:13.0252320Z compiled: bool, 2025-05-07T20:33:13.0252396Z ) -> None: 2025-05-07T20:33:13.0252489Z torch.manual_seed(2025) 2025-05-07T20:33:13.0252564Z 2025-05-07T20:33:13.0252727Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0252798Z 2025-05-07T20:33:13.0252954Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0253139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0253227Z x = x_sign * x_clamp 2025-05-07T20:33:13.0253304Z x0 = x[:, :D] 2025-05-07T20:33:13.0253381Z x1 = x[:, D:] 2025-05-07T20:33:13.0253450Z 2025-05-07T20:33:13.0253532Z if contiguous: 2025-05-07T20:33:13.0253619Z x0 = x0.contiguous() 2025-05-07T20:33:13.0253709Z x1 = x1.contiguous() 2025-05-07T20:33:13.0253824Z 2025-05-07T20:33:13.0253910Z if scale_ub is not None: 2025-05-07T20:33:13.0254014Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0254141Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0254220Z ) 2025-05-07T20:33:13.0254291Z else: 2025-05-07T20:33:13.0254382Z scale_ub_tensor = None 2025-05-07T20:33:13.0254454Z 2025-05-07T20:33:13.0254580Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0254713Z op = silu_mul_quant 2025-05-07T20:33:13.0254799Z if compiled: 2025-05-07T20:33:13.0254896Z op = torch.compile(op) 2025-05-07T20:33:13.0254996Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0255067Z 2025-05-07T20:33:13.0255154Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0255158Z 2025-05-07T20:33:13.0255249Z moe/activation_test.py:117: 2025-05-07T20:33:13.0255379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0255475Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0255573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0255933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0256019Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0256503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0256602Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0256948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0257168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0257497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0257593Z kernel = self.compile( 2025-05-07T20:33:13.0257964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0258133Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0258257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0258261Z 2025-05-07T20:33:13.0258458Z self = 2025-05-07T20:33:13.0259400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0260060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1978ea0>} 2025-05-07T20:33:13.0260839Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0261025Z context = 2025-05-07T20:33:13.0261030Z 2025-05-07T20:33:13.0261191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0261519Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0261629Z module_map=module_map) 2025-05-07T20:33:13.0261785Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0261882Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0261956Z E ^ 2025-05-07T20:33:13.0262302Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0262367Z 2025-05-07T20:33:13.0262770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0262775Z 2025-05-07T20:33:13.0262874Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0263093Z self=, 2025-05-07T20:33:13.0263164Z T=16384, 2025-05-07T20:33:13.0263240Z D=7168, 2025-05-07T20:33:13.0263319Z scale_ub=1200.0, 2025-05-07T20:33:13.0263398Z contiguous=True, 2025-05-07T20:33:13.0263543Z compiled=True, 2025-05-07T20:33:13.0263617Z ) 2025-05-07T20:33:13.0263829Z self = 2025-05-07T20:33:13.0264000Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:13.0264005Z 2025-05-07T20:33:13.0264077Z @given( 2025-05-07T20:33:13.0264192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0264295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0264404Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0264519Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0264627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0264701Z ) 2025-05-07T20:33:13.0264946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0265033Z def test_silu_mul_quant( 2025-05-07T20:33:13.0265107Z self, 2025-05-07T20:33:13.0265186Z T: int, 2025-05-07T20:33:13.0265262Z D: int, 2025-05-07T20:33:13.0265358Z scale_ub: Optional[float], 2025-05-07T20:33:13.0265451Z contiguous: bool, 2025-05-07T20:33:13.0265532Z compiled: bool, 2025-05-07T20:33:13.0265605Z ) -> None: 2025-05-07T20:33:13.0265700Z torch.manual_seed(2025) 2025-05-07T20:33:13.0265772Z 2025-05-07T20:33:13.0265935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0266013Z 2025-05-07T20:33:13.0266098Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0266221Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0266304Z x = x_sign * x_clamp 2025-05-07T20:33:13.0266381Z x0 = x[:, :D] 2025-05-07T20:33:13.0266459Z x1 = x[:, D:] 2025-05-07T20:33:13.0266528Z 2025-05-07T20:33:13.0266606Z if contiguous: 2025-05-07T20:33:13.0266693Z x0 = x0.contiguous() 2025-05-07T20:33:13.0266777Z x1 = x1.contiguous() 2025-05-07T20:33:13.0266903Z 2025-05-07T20:33:13.0266993Z if scale_ub is not None: 2025-05-07T20:33:13.0267095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0267224Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0267300Z ) 2025-05-07T20:33:13.0267374Z else: 2025-05-07T20:33:13.0267469Z scale_ub_tensor = None 2025-05-07T20:33:13.0267535Z 2025-05-07T20:33:13.0267661Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0267752Z op = silu_mul_quant 2025-05-07T20:33:13.0267833Z if compiled: 2025-05-07T20:33:13.0267927Z op = torch.compile(op) 2025-05-07T20:33:13.0268031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0268101Z 2025-05-07T20:33:13.0268187Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0268191Z 2025-05-07T20:33:13.0268287Z moe/activation_test.py:117: 2025-05-07T20:33:13.0268456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0268560Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0268654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0269018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0269108Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0269591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0269724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0270074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0270290Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0270626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0270757Z kernel = self.compile( 2025-05-07T20:33:13.0271133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0271304Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0271426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0271431Z 2025-05-07T20:33:13.0271635Z self = 2025-05-07T20:33:13.0272391Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0272883Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a197a0c0>} 2025-05-07T20:33:13.0273619Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0273806Z context = 2025-05-07T20:33:13.0273810Z 2025-05-07T20:33:13.0273971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0274227Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0274329Z module_map=module_map) 2025-05-07T20:33:13.0274486Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0274579Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0274650Z E ^ 2025-05-07T20:33:13.0274997Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0275001Z 2025-05-07T20:33:13.0275450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0275454Z 2025-05-07T20:33:13.0275555Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0275771Z self=, 2025-05-07T20:33:13.0275842Z T=16384, 2025-05-07T20:33:13.0275915Z D=5120, 2025-05-07T20:33:13.0275999Z scale_ub=1200.0, 2025-05-07T20:33:13.0276078Z contiguous=True, 2025-05-07T20:33:13.0276165Z compiled=False, 2025-05-07T20:33:13.0276233Z ) 2025-05-07T20:33:13.0279580Z self = 2025-05-07T20:33:13.0279775Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:13.0279780Z 2025-05-07T20:33:13.0279859Z @given( 2025-05-07T20:33:13.0279974Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0280138Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0280279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0280402Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0280529Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0280601Z ) 2025-05-07T20:33:13.0280842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0280980Z def test_silu_mul_quant( 2025-05-07T20:33:13.0281052Z self, 2025-05-07T20:33:13.0281123Z T: int, 2025-05-07T20:33:13.0281198Z D: int, 2025-05-07T20:33:13.0281294Z scale_ub: Optional[float], 2025-05-07T20:33:13.0281381Z contiguous: bool, 2025-05-07T20:33:13.0281468Z compiled: bool, 2025-05-07T20:33:13.0281545Z ) -> None: 2025-05-07T20:33:13.0281637Z torch.manual_seed(2025) 2025-05-07T20:33:13.0281709Z 2025-05-07T20:33:13.0281876Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0281959Z 2025-05-07T20:33:13.0282092Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0282215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0282301Z x = x_sign * x_clamp 2025-05-07T20:33:13.0282377Z x0 = x[:, :D] 2025-05-07T20:33:13.0282450Z x1 = x[:, D:] 2025-05-07T20:33:13.0282524Z 2025-05-07T20:33:13.0282603Z if contiguous: 2025-05-07T20:33:13.0282692Z x0 = x0.contiguous() 2025-05-07T20:33:13.0282780Z x1 = x1.contiguous() 2025-05-07T20:33:13.0282848Z 2025-05-07T20:33:13.0282936Z if scale_ub is not None: 2025-05-07T20:33:13.0283040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0283171Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0283242Z ) 2025-05-07T20:33:13.0283316Z else: 2025-05-07T20:33:13.0283406Z scale_ub_tensor = None 2025-05-07T20:33:13.0283478Z 2025-05-07T20:33:13.0283608Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0283698Z op = silu_mul_quant 2025-05-07T20:33:13.0283782Z if compiled: 2025-05-07T20:33:13.0283877Z op = torch.compile(op) 2025-05-07T20:33:13.0283976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0284048Z 2025-05-07T20:33:13.0284134Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0284139Z 2025-05-07T20:33:13.0284235Z moe/activation_test.py:117: 2025-05-07T20:33:13.0284363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0284461Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0284559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0285048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0285143Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0285499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0285786Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0286121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0286213Z kernel = self.compile( 2025-05-07T20:33:13.0286585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0286760Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0286883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0286887Z 2025-05-07T20:33:13.0287086Z self = 2025-05-07T20:33:13.0287890Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0288385Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1979a80>} 2025-05-07T20:33:13.0289114Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0289361Z context = 2025-05-07T20:33:13.0289366Z 2025-05-07T20:33:13.0289528Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0289781Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0289884Z module_map=module_map) 2025-05-07T20:33:13.0290088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0290187Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0290264Z E ^ 2025-05-07T20:33:13.0290657Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0290663Z 2025-05-07T20:33:13.0291068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0291079Z 2025-05-07T20:33:13.0291181Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0291397Z self=, 2025-05-07T20:33:13.0291469Z T=1, 2025-05-07T20:33:13.0291541Z D=7168, 2025-05-07T20:33:13.0291619Z scale_ub=1200.0, 2025-05-07T20:33:13.0291704Z contiguous=False, 2025-05-07T20:33:13.0291784Z compiled=False, 2025-05-07T20:33:13.0291855Z ) 2025-05-07T20:33:13.0292072Z self = 2025-05-07T20:33:13.0292241Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:13.0292245Z 2025-05-07T20:33:13.0292318Z @given( 2025-05-07T20:33:13.0292440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0292536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0292649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0292769Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0292879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0292951Z ) 2025-05-07T20:33:13.0293304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0293394Z def test_silu_mul_quant( 2025-05-07T20:33:13.0293468Z self, 2025-05-07T20:33:13.0293539Z T: int, 2025-05-07T20:33:13.0293610Z D: int, 2025-05-07T20:33:13.0293709Z scale_ub: Optional[float], 2025-05-07T20:33:13.0293848Z contiguous: bool, 2025-05-07T20:33:13.0293966Z compiled: bool, 2025-05-07T20:33:13.0294075Z ) -> None: 2025-05-07T20:33:13.0294198Z torch.manual_seed(2025) 2025-05-07T20:33:13.0294290Z 2025-05-07T20:33:13.0294518Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0294618Z 2025-05-07T20:33:13.0294718Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0294849Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0294932Z x = x_sign * x_clamp 2025-05-07T20:33:13.0295016Z x0 = x[:, :D] 2025-05-07T20:33:13.0295095Z x1 = x[:, D:] 2025-05-07T20:33:13.0295174Z 2025-05-07T20:33:13.0295261Z if contiguous: 2025-05-07T20:33:13.0295378Z x0 = x0.contiguous() 2025-05-07T20:33:13.0295497Z x1 = x1.contiguous() 2025-05-07T20:33:13.0295594Z 2025-05-07T20:33:13.0295681Z if scale_ub is not None: 2025-05-07T20:33:13.0295849Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0295989Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0296062Z ) 2025-05-07T20:33:13.0296133Z else: 2025-05-07T20:33:13.0296226Z scale_ub_tensor = None 2025-05-07T20:33:13.0296297Z 2025-05-07T20:33:13.0296423Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0296510Z op = silu_mul_quant 2025-05-07T20:33:13.0296637Z if compiled: 2025-05-07T20:33:13.0296734Z op = torch.compile(op) 2025-05-07T20:33:13.0296834Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0296903Z 2025-05-07T20:33:13.0296993Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0296998Z 2025-05-07T20:33:13.0297089Z moe/activation_test.py:117: 2025-05-07T20:33:13.0297213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0297315Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0297453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0297944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0298040Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0298391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0298617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0298946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0299036Z kernel = self.compile( 2025-05-07T20:33:13.0299410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0299580Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0299712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0299718Z 2025-05-07T20:33:13.0299917Z self = 2025-05-07T20:33:13.0300680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0301177Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552a400e0>} 2025-05-07T20:33:13.0301903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0302090Z context = 2025-05-07T20:33:13.0302138Z 2025-05-07T20:33:13.0302305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0302557Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0302664Z module_map=module_map) 2025-05-07T20:33:13.0302823Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0302921Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0302998Z E ^ 2025-05-07T20:33:13.0303341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0303345Z 2025-05-07T20:33:13.0303750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0303755Z 2025-05-07T20:33:13.0303853Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0304112Z self=, 2025-05-07T20:33:13.0304192Z T=4096, 2025-05-07T20:33:13.0304264Z D=7168, 2025-05-07T20:33:13.0304348Z scale_ub=1200.0, 2025-05-07T20:33:13.0304431Z contiguous=False, 2025-05-07T20:33:13.0304509Z compiled=True, 2025-05-07T20:33:13.0304579Z ) 2025-05-07T20:33:13.0304789Z self = 2025-05-07T20:33:13.0304959Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:13.0305005Z 2025-05-07T20:33:13.0305086Z @given( 2025-05-07T20:33:13.0305201Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0305301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0305411Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0305523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0305635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0305705Z ) 2025-05-07T20:33:13.0305986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0306082Z def test_silu_mul_quant( 2025-05-07T20:33:13.0306158Z self, 2025-05-07T20:33:13.0306232Z T: int, 2025-05-07T20:33:13.0306310Z D: int, 2025-05-07T20:33:13.0306407Z scale_ub: Optional[float], 2025-05-07T20:33:13.0306492Z contiguous: bool, 2025-05-07T20:33:13.0306576Z compiled: bool, 2025-05-07T20:33:13.0306654Z ) -> None: 2025-05-07T20:33:13.0306747Z torch.manual_seed(2025) 2025-05-07T20:33:13.0306814Z 2025-05-07T20:33:13.0306977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0307051Z 2025-05-07T20:33:13.0307139Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0307262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0307353Z x = x_sign * x_clamp 2025-05-07T20:33:13.0307431Z x0 = x[:, :D] 2025-05-07T20:33:13.0307508Z x1 = x[:, D:] 2025-05-07T20:33:13.0307583Z 2025-05-07T20:33:13.0307663Z if contiguous: 2025-05-07T20:33:13.0307749Z x0 = x0.contiguous() 2025-05-07T20:33:13.0307836Z x1 = x1.contiguous() 2025-05-07T20:33:13.0307906Z 2025-05-07T20:33:13.0307993Z if scale_ub is not None: 2025-05-07T20:33:13.0308099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0308228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0308307Z ) 2025-05-07T20:33:13.0308379Z else: 2025-05-07T20:33:13.0308470Z scale_ub_tensor = None 2025-05-07T20:33:13.0308543Z 2025-05-07T20:33:13.0308666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0308752Z op = silu_mul_quant 2025-05-07T20:33:13.0308837Z if compiled: 2025-05-07T20:33:13.0308934Z op = torch.compile(op) 2025-05-07T20:33:13.0309032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0309103Z 2025-05-07T20:33:13.0309315Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0309320Z 2025-05-07T20:33:13.0309415Z moe/activation_test.py:117: 2025-05-07T20:33:13.0309538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0309634Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0309732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0310094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0310193Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0310719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0310812Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0311166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0311425Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0311762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0311856Z kernel = self.compile( 2025-05-07T20:33:13.0312225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0312393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0312564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0312569Z 2025-05-07T20:33:13.0312766Z self = 2025-05-07T20:33:13.0313523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0314081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552a41300>} 2025-05-07T20:33:13.0314812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0314999Z context = 2025-05-07T20:33:13.0315004Z 2025-05-07T20:33:13.0315164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0315421Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0315523Z module_map=module_map) 2025-05-07T20:33:13.0315683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0315777Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0315858Z E ^ 2025-05-07T20:33:13.0316206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0316211Z 2025-05-07T20:33:13.0316612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0316616Z 2025-05-07T20:33:13.0316713Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0316936Z self=, 2025-05-07T20:33:13.0317008Z T=128, 2025-05-07T20:33:13.0317083Z D=7168, 2025-05-07T20:33:13.0317161Z scale_ub=1200.0, 2025-05-07T20:33:13.0317242Z contiguous=False, 2025-05-07T20:33:13.0317325Z compiled=True, 2025-05-07T20:33:13.0317393Z ) 2025-05-07T20:33:13.0317604Z self = 2025-05-07T20:33:13.0317774Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:13.0317825Z 2025-05-07T20:33:13.0317899Z @given( 2025-05-07T20:33:13.0318015Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0318111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0318222Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0318338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0318446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0318519Z ) 2025-05-07T20:33:13.0318761Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0318850Z def test_silu_mul_quant( 2025-05-07T20:33:13.0318923Z self, 2025-05-07T20:33:13.0318997Z T: int, 2025-05-07T20:33:13.0319069Z D: int, 2025-05-07T20:33:13.0319164Z scale_ub: Optional[float], 2025-05-07T20:33:13.0319253Z contiguous: bool, 2025-05-07T20:33:13.0319334Z compiled: bool, 2025-05-07T20:33:13.0319448Z ) -> None: 2025-05-07T20:33:13.0319549Z torch.manual_seed(2025) 2025-05-07T20:33:13.0319619Z 2025-05-07T20:33:13.0319786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0319853Z 2025-05-07T20:33:13.0319943Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0320074Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0320160Z x = x_sign * x_clamp 2025-05-07T20:33:13.0320295Z x0 = x[:, :D] 2025-05-07T20:33:13.0320384Z x1 = x[:, D:] 2025-05-07T20:33:13.0320466Z 2025-05-07T20:33:13.0320558Z if contiguous: 2025-05-07T20:33:13.0320647Z x0 = x0.contiguous() 2025-05-07T20:33:13.0320733Z x1 = x1.contiguous() 2025-05-07T20:33:13.0320799Z 2025-05-07T20:33:13.0320888Z if scale_ub is not None: 2025-05-07T20:33:13.0320990Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0321121Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0321234Z ) 2025-05-07T20:33:13.0321309Z else: 2025-05-07T20:33:13.0321404Z scale_ub_tensor = None 2025-05-07T20:33:13.0321474Z 2025-05-07T20:33:13.0321598Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0321687Z op = silu_mul_quant 2025-05-07T20:33:13.0321768Z if compiled: 2025-05-07T20:33:13.0321863Z op = torch.compile(op) 2025-05-07T20:33:13.0321969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0322039Z 2025-05-07T20:33:13.0322125Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0322129Z 2025-05-07T20:33:13.0322228Z moe/activation_test.py:117: 2025-05-07T20:33:13.0322350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0322451Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0322546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0322910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0323003Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0323485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0323579Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0323930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0324149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0324481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0324572Z kernel = self.compile( 2025-05-07T20:33:13.0324941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0325117Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0325287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0325291Z 2025-05-07T20:33:13.0325495Z self = 2025-05-07T20:33:13.0326248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0326741Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552a42020>} 2025-05-07T20:33:13.0327467Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0327691Z context = 2025-05-07T20:33:13.0327698Z 2025-05-07T20:33:13.0327862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0328115Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0328221Z module_map=module_map) 2025-05-07T20:33:13.0328382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0328520Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0328593Z E ^ 2025-05-07T20:33:13.0328939Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0328944Z 2025-05-07T20:33:13.0329343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0329348Z 2025-05-07T20:33:13.0329446Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0329706Z self=, 2025-05-07T20:33:13.0329782Z T=2048, 2025-05-07T20:33:13.0329859Z D=7168, 2025-05-07T20:33:13.0329937Z scale_ub=None, 2025-05-07T20:33:13.0330023Z contiguous=True, 2025-05-07T20:33:13.0330104Z compiled=True, 2025-05-07T20:33:13.0330176Z ) 2025-05-07T20:33:13.0330387Z self = 2025-05-07T20:33:13.0330556Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.0330561Z 2025-05-07T20:33:13.0330636Z @given( 2025-05-07T20:33:13.0330751Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0330850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0330958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0331073Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0331181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0331259Z ) 2025-05-07T20:33:13.0331503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0331595Z def test_silu_mul_quant( 2025-05-07T20:33:13.0331665Z self, 2025-05-07T20:33:13.0331742Z T: int, 2025-05-07T20:33:13.0331816Z D: int, 2025-05-07T20:33:13.0331911Z scale_ub: Optional[float], 2025-05-07T20:33:13.0331999Z contiguous: bool, 2025-05-07T20:33:13.0332086Z compiled: bool, 2025-05-07T20:33:13.0332159Z ) -> None: 2025-05-07T20:33:13.0332252Z torch.manual_seed(2025) 2025-05-07T20:33:13.0332321Z 2025-05-07T20:33:13.0332490Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0332562Z 2025-05-07T20:33:13.0332649Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0332773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0332856Z x = x_sign * x_clamp 2025-05-07T20:33:13.0332934Z x0 = x[:, :D] 2025-05-07T20:33:13.0333119Z x1 = x[:, D:] 2025-05-07T20:33:13.0333186Z 2025-05-07T20:33:13.0333266Z if contiguous: 2025-05-07T20:33:13.0333354Z x0 = x0.contiguous() 2025-05-07T20:33:13.0333440Z x1 = x1.contiguous() 2025-05-07T20:33:13.0333509Z 2025-05-07T20:33:13.0333596Z if scale_ub is not None: 2025-05-07T20:33:13.0333700Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0333830Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0333910Z ) 2025-05-07T20:33:13.0333983Z else: 2025-05-07T20:33:13.0334074Z scale_ub_tensor = None 2025-05-07T20:33:13.0334144Z 2025-05-07T20:33:13.0334268Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0334355Z op = silu_mul_quant 2025-05-07T20:33:13.0334436Z if compiled: 2025-05-07T20:33:13.0334530Z op = torch.compile(op) 2025-05-07T20:33:13.0334679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0334752Z 2025-05-07T20:33:13.0334839Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0334843Z 2025-05-07T20:33:13.0334938Z moe/activation_test.py:117: 2025-05-07T20:33:13.0335061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0335165Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0335262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0335667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0335758Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0336241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0336336Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0336688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0336947Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0337278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0337369Z kernel = self.compile( 2025-05-07T20:33:13.0337740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0337916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0338037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0338042Z 2025-05-07T20:33:13.0338241Z self = 2025-05-07T20:33:13.0339002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0339495Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0552a43240>} 2025-05-07T20:33:13.0340252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0340465Z context = 2025-05-07T20:33:13.0340469Z 2025-05-07T20:33:13.0340633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0340886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0340990Z module_map=module_map) 2025-05-07T20:33:13.0341148Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0341288Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0341362Z E ^ 2025-05-07T20:33:13.0341711Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0341715Z 2025-05-07T20:33:13.0342115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0342122Z 2025-05-07T20:33:13.0342224Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0342439Z self=, 2025-05-07T20:33:13.0342513Z T=16384, 2025-05-07T20:33:13.0342591Z D=5120, 2025-05-07T20:33:13.0342670Z scale_ub=None, 2025-05-07T20:33:13.0342752Z contiguous=False, 2025-05-07T20:33:13.0342836Z compiled=False, 2025-05-07T20:33:13.0342906Z ) 2025-05-07T20:33:13.0343118Z self = 2025-05-07T20:33:13.0343336Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:13.0343344Z 2025-05-07T20:33:13.0343415Z @given( 2025-05-07T20:33:13.0343533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0343630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0343741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0343855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0344028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0344099Z ) 2025-05-07T20:33:13.0344342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0344431Z def test_silu_mul_quant( 2025-05-07T20:33:13.0344507Z self, 2025-05-07T20:33:13.0344581Z T: int, 2025-05-07T20:33:13.0344655Z D: int, 2025-05-07T20:33:13.0344753Z scale_ub: Optional[float], 2025-05-07T20:33:13.0344838Z contiguous: bool, 2025-05-07T20:33:13.0344964Z compiled: bool, 2025-05-07T20:33:13.0345044Z ) -> None: 2025-05-07T20:33:13.0345134Z torch.manual_seed(2025) 2025-05-07T20:33:13.0345203Z 2025-05-07T20:33:13.0345367Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0345435Z 2025-05-07T20:33:13.0345520Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0345645Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0347426Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0347434Z 2025-05-07T20:33:13.0347554Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:13.0347559Z 2025-05-07T20:33:13.0347656Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0347877Z self=, 2025-05-07T20:33:13.0347947Z T=4096, 2025-05-07T20:33:13.0348019Z D=7168, 2025-05-07T20:33:13.0348103Z scale_ub=1200.0, 2025-05-07T20:33:13.0348181Z contiguous=True, 2025-05-07T20:33:13.0348259Z compiled=True, 2025-05-07T20:33:13.0348336Z ) 2025-05-07T20:33:13.0348546Z self = 2025-05-07T20:33:13.0348710Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:13.0348715Z 2025-05-07T20:33:13.0348790Z @given( 2025-05-07T20:33:13.0348903Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0349003Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0349157Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0349270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0349382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0349450Z ) 2025-05-07T20:33:13.0349688Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0349779Z def test_silu_mul_quant( 2025-05-07T20:33:13.0349854Z self, 2025-05-07T20:33:13.0349924Z T: int, 2025-05-07T20:33:13.0349999Z D: int, 2025-05-07T20:33:13.0350096Z scale_ub: Optional[float], 2025-05-07T20:33:13.0350180Z contiguous: bool, 2025-05-07T20:33:13.0350268Z compiled: bool, 2025-05-07T20:33:13.0350358Z ) -> None: 2025-05-07T20:33:13.0350461Z torch.manual_seed(2025) 2025-05-07T20:33:13.0350549Z 2025-05-07T20:33:13.0350714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0350834Z 2025-05-07T20:33:13.0350927Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0351049Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0352806Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0352851Z 2025-05-07T20:33:13.0352969Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:13.0352973Z 2025-05-07T20:33:13.0353076Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0353335Z self=, 2025-05-07T20:33:13.0353413Z T=16384, 2025-05-07T20:33:13.0353489Z D=7168, 2025-05-07T20:33:13.0353565Z scale_ub=None, 2025-05-07T20:33:13.0353648Z contiguous=False, 2025-05-07T20:33:13.0353728Z compiled=False, 2025-05-07T20:33:13.0353798Z ) 2025-05-07T20:33:13.0354012Z self = 2025-05-07T20:33:13.0354185Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:13.0354189Z 2025-05-07T20:33:13.0354263Z @given( 2025-05-07T20:33:13.0354380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0354475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0354586Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0354701Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0354811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0354888Z ) 2025-05-07T20:33:13.0355135Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0355223Z def test_silu_mul_quant( 2025-05-07T20:33:13.0355301Z self, 2025-05-07T20:33:13.0355375Z T: int, 2025-05-07T20:33:13.0355446Z D: int, 2025-05-07T20:33:13.0355548Z scale_ub: Optional[float], 2025-05-07T20:33:13.0355636Z contiguous: bool, 2025-05-07T20:33:13.0355721Z compiled: bool, 2025-05-07T20:33:13.0355799Z ) -> None: 2025-05-07T20:33:13.0355891Z torch.manual_seed(2025) 2025-05-07T20:33:13.0355962Z 2025-05-07T20:33:13.0356129Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0357887Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0357942Z 2025-05-07T20:33:13.0358057Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0358062Z 2025-05-07T20:33:13.0358163Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0358382Z self=, 2025-05-07T20:33:13.0358455Z T=2048, 2025-05-07T20:33:13.0358530Z D=7168, 2025-05-07T20:33:13.0358612Z scale_ub=1200.0, 2025-05-07T20:33:13.0358694Z contiguous=True, 2025-05-07T20:33:13.0358772Z compiled=True, 2025-05-07T20:33:13.0358847Z ) 2025-05-07T20:33:13.0359059Z self = 2025-05-07T20:33:13.0359507Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:13.0359529Z 2025-05-07T20:33:13.0359644Z @given( 2025-05-07T20:33:13.0359764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0359863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0359974Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0360087Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0360292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0360365Z ) 2025-05-07T20:33:13.0360606Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0360698Z def test_silu_mul_quant( 2025-05-07T20:33:13.0360771Z self, 2025-05-07T20:33:13.0360843Z T: int, 2025-05-07T20:33:13.0360919Z D: int, 2025-05-07T20:33:13.0361011Z scale_ub: Optional[float], 2025-05-07T20:33:13.0361101Z contiguous: bool, 2025-05-07T20:33:13.0361184Z compiled: bool, 2025-05-07T20:33:13.0361324Z ) -> None: 2025-05-07T20:33:13.0361420Z torch.manual_seed(2025) 2025-05-07T20:33:13.0361493Z 2025-05-07T20:33:13.0361656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0361731Z 2025-05-07T20:33:13.0361819Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0361941Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0363689Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0363701Z 2025-05-07T20:33:13.0363813Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:13.0363818Z 2025-05-07T20:33:13.0363918Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0364134Z self=, 2025-05-07T20:33:13.0364211Z T=2048, 2025-05-07T20:33:13.0364287Z D=7168, 2025-05-07T20:33:13.0364362Z scale_ub=None, 2025-05-07T20:33:13.0364447Z contiguous=True, 2025-05-07T20:33:13.0364526Z compiled=False, 2025-05-07T20:33:13.0364594Z ) 2025-05-07T20:33:13.0364807Z self = 2025-05-07T20:33:13.0364970Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:13.0364974Z 2025-05-07T20:33:13.0365048Z @given( 2025-05-07T20:33:13.0365163Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0365260Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0365373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0365547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0365656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0365733Z ) 2025-05-07T20:33:13.0365976Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0366067Z def test_silu_mul_quant( 2025-05-07T20:33:13.0366147Z self, 2025-05-07T20:33:13.0366220Z T: int, 2025-05-07T20:33:13.0366295Z D: int, 2025-05-07T20:33:13.0366392Z scale_ub: Optional[float], 2025-05-07T20:33:13.0366478Z contiguous: bool, 2025-05-07T20:33:13.0366560Z compiled: bool, 2025-05-07T20:33:13.0366638Z ) -> None: 2025-05-07T20:33:13.0366731Z torch.manual_seed(2025) 2025-05-07T20:33:13.0366805Z 2025-05-07T20:33:13.0366968Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0367037Z 2025-05-07T20:33:13.0367194Z > x_sign = torch.sign(x) 2025-05-07T20:33:13.0368935Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0368986Z 2025-05-07T20:33:13.0369105Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:13.0369109Z 2025-05-07T20:33:13.0369207Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0369422Z self=, 2025-05-07T20:33:13.0369500Z T=1, 2025-05-07T20:33:13.0369579Z D=7168, 2025-05-07T20:33:13.0369700Z scale_ub=1200.0, 2025-05-07T20:33:13.0369789Z contiguous=True, 2025-05-07T20:33:13.0369869Z compiled=False, 2025-05-07T20:33:13.0369946Z ) 2025-05-07T20:33:13.0370174Z self = 2025-05-07T20:33:13.0370357Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:13.0370362Z 2025-05-07T20:33:13.0370454Z @given( 2025-05-07T20:33:13.0370565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0370659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0370771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0370883Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0370994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0371070Z ) 2025-05-07T20:33:13.0371311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0371409Z def test_silu_mul_quant( 2025-05-07T20:33:13.0371482Z self, 2025-05-07T20:33:13.0371553Z T: int, 2025-05-07T20:33:13.0371630Z D: int, 2025-05-07T20:33:13.0371725Z scale_ub: Optional[float], 2025-05-07T20:33:13.0371808Z contiguous: bool, 2025-05-07T20:33:13.0371894Z compiled: bool, 2025-05-07T20:33:13.0371969Z ) -> None: 2025-05-07T20:33:13.0372061Z torch.manual_seed(2025) 2025-05-07T20:33:13.0372137Z 2025-05-07T20:33:13.0372298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0372370Z 2025-05-07T20:33:13.0372462Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0372582Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0372672Z x = x_sign * x_clamp 2025-05-07T20:33:13.0372750Z x0 = x[:, :D] 2025-05-07T20:33:13.0372827Z x1 = x[:, D:] 2025-05-07T20:33:13.0372904Z 2025-05-07T20:33:13.0373049Z if contiguous: 2025-05-07T20:33:13.0373192Z x0 = x0.contiguous() 2025-05-07T20:33:13.0373283Z x1 = x1.contiguous() 2025-05-07T20:33:13.0373353Z 2025-05-07T20:33:13.0373439Z if scale_ub is not None: 2025-05-07T20:33:13.0373545Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0373678Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0373752Z ) 2025-05-07T20:33:13.0373835Z else: 2025-05-07T20:33:13.0373930Z scale_ub_tensor = None 2025-05-07T20:33:13.0374001Z 2025-05-07T20:33:13.0374133Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0374223Z op = silu_mul_quant 2025-05-07T20:33:13.0374311Z if compiled: 2025-05-07T20:33:13.0374408Z op = torch.compile(op) 2025-05-07T20:33:13.0374511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0374585Z 2025-05-07T20:33:13.0374674Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0374678Z 2025-05-07T20:33:13.0374843Z moe/activation_test.py:117: 2025-05-07T20:33:13.0374974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0375075Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0375171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0375667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0375802Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0376157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0376375Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0376712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0376806Z kernel = self.compile( 2025-05-07T20:33:13.0377226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0377406Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0377532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0377536Z 2025-05-07T20:33:13.0377738Z self = 2025-05-07T20:33:13.0378508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0379000Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1732520>} 2025-05-07T20:33:13.0379741Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0379932Z context = 2025-05-07T20:33:13.0379936Z 2025-05-07T20:33:13.0380097Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0380362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0380498Z module_map=module_map) 2025-05-07T20:33:13.0380682Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0380777Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0380851Z E ^ 2025-05-07T20:33:13.0381196Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0381200Z 2025-05-07T20:33:13.0381606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0381654Z 2025-05-07T20:33:13.0381758Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0381973Z self=, 2025-05-07T20:33:13.0382045Z T=128, 2025-05-07T20:33:13.0382121Z D=5120, 2025-05-07T20:33:13.0382200Z scale_ub=None, 2025-05-07T20:33:13.0382280Z contiguous=True, 2025-05-07T20:33:13.0382366Z compiled=False, 2025-05-07T20:33:13.0382433Z ) 2025-05-07T20:33:13.0382643Z self = 2025-05-07T20:33:13.0382810Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:13.0382814Z 2025-05-07T20:33:13.0382886Z @given( 2025-05-07T20:33:13.0383004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0383099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0383251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0383375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0383487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0383560Z ) 2025-05-07T20:33:13.0383806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0383897Z def test_silu_mul_quant( 2025-05-07T20:33:13.0383969Z self, 2025-05-07T20:33:13.0384085Z T: int, 2025-05-07T20:33:13.0384157Z D: int, 2025-05-07T20:33:13.0384251Z scale_ub: Optional[float], 2025-05-07T20:33:13.0384342Z contiguous: bool, 2025-05-07T20:33:13.0384424Z compiled: bool, 2025-05-07T20:33:13.0384501Z ) -> None: 2025-05-07T20:33:13.0384595Z torch.manual_seed(2025) 2025-05-07T20:33:13.0384666Z 2025-05-07T20:33:13.0384834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0384908Z 2025-05-07T20:33:13.0384998Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0385167Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0385255Z x = x_sign * x_clamp 2025-05-07T20:33:13.0385335Z x0 = x[:, :D] 2025-05-07T20:33:13.0385420Z x1 = x[:, D:] 2025-05-07T20:33:13.0385493Z 2025-05-07T20:33:13.0385576Z if contiguous: 2025-05-07T20:33:13.0385669Z x0 = x0.contiguous() 2025-05-07T20:33:13.0385754Z x1 = x1.contiguous() 2025-05-07T20:33:13.0385832Z 2025-05-07T20:33:13.0385920Z if scale_ub is not None: 2025-05-07T20:33:13.0386023Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0386156Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0386227Z ) 2025-05-07T20:33:13.0386303Z else: 2025-05-07T20:33:13.0386394Z scale_ub_tensor = None 2025-05-07T20:33:13.0386460Z 2025-05-07T20:33:13.0386584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0386674Z op = silu_mul_quant 2025-05-07T20:33:13.0386759Z if compiled: 2025-05-07T20:33:13.0386854Z op = torch.compile(op) 2025-05-07T20:33:13.0386957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0387023Z 2025-05-07T20:33:13.0387113Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0387117Z 2025-05-07T20:33:13.0387209Z moe/activation_test.py:117: 2025-05-07T20:33:13.0387334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0387435Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0387529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0388019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0388117Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0388465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0388735Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0389066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0389155Z kernel = self.compile( 2025-05-07T20:33:13.0389529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0389702Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0389828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0389836Z 2025-05-07T20:33:13.0390060Z self = 2025-05-07T20:33:13.0390888Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0391387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1733420>} 2025-05-07T20:33:13.0392116Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0392346Z context = 2025-05-07T20:33:13.0392350Z 2025-05-07T20:33:13.0392508Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0392761Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0392870Z module_map=module_map) 2025-05-07T20:33:13.0393029Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0393166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0393246Z E ^ 2025-05-07T20:33:13.0393591Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0393596Z 2025-05-07T20:33:13.0394000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0394007Z 2025-05-07T20:33:13.0394107Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0394323Z self=, 2025-05-07T20:33:13.0394398Z T=128, 2025-05-07T20:33:13.0394470Z D=7168, 2025-05-07T20:33:13.0394552Z scale_ub=None, 2025-05-07T20:33:13.0394633Z contiguous=True, 2025-05-07T20:33:13.0394712Z compiled=False, 2025-05-07T20:33:13.0394785Z ) 2025-05-07T20:33:13.0394996Z self = 2025-05-07T20:33:13.0395163Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:13.0395170Z 2025-05-07T20:33:13.0395244Z @given( 2025-05-07T20:33:13.0395359Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0395455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0395567Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0395679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0395794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0395864Z ) 2025-05-07T20:33:13.0396146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0396279Z def test_silu_mul_quant( 2025-05-07T20:33:13.0396380Z self, 2025-05-07T20:33:13.0399923Z T: int, 2025-05-07T20:33:13.0400017Z D: int, 2025-05-07T20:33:13.0400118Z scale_ub: Optional[float], 2025-05-07T20:33:13.0400211Z contiguous: bool, 2025-05-07T20:33:13.0400301Z compiled: bool, 2025-05-07T20:33:13.0400647Z ) -> None: 2025-05-07T20:33:13.0400931Z torch.manual_seed(2025) 2025-05-07T20:33:13.0401197Z 2025-05-07T20:33:13.0401523Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0401870Z 2025-05-07T20:33:13.0402064Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0402356Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0402662Z x = x_sign * x_clamp 2025-05-07T20:33:13.0402903Z x0 = x[:, :D] 2025-05-07T20:33:13.0403127Z x1 = x[:, D:] 2025-05-07T20:33:13.0403330Z 2025-05-07T20:33:13.0403516Z if contiguous: 2025-05-07T20:33:13.0403764Z x0 = x0.contiguous() 2025-05-07T20:33:13.0404020Z x1 = x1.contiguous() 2025-05-07T20:33:13.0404255Z 2025-05-07T20:33:13.0404472Z if scale_ub is not None: 2025-05-07T20:33:13.0404743Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0405139Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0405458Z ) 2025-05-07T20:33:13.0405662Z else: 2025-05-07T20:33:13.0405880Z scale_ub_tensor = None 2025-05-07T20:33:13.0406132Z 2025-05-07T20:33:13.0406364Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0406739Z op = silu_mul_quant 2025-05-07T20:33:13.0407030Z if compiled: 2025-05-07T20:33:13.0407326Z op = torch.compile(op) 2025-05-07T20:33:13.0407697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0407970Z 2025-05-07T20:33:13.0408151Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0408311Z 2025-05-07T20:33:13.0408409Z moe/activation_test.py:117: 2025-05-07T20:33:13.0408693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0409019Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0409287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0410015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0410748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0411281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0411946Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0412597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0413208Z kernel = self.compile( 2025-05-07T20:33:13.0413736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0414369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0414751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0414975Z 2025-05-07T20:33:13.0415187Z self = 2025-05-07T20:33:13.0416244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0417601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1a204a0>} 2025-05-07T20:33:13.0418909Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0419930Z context = 2025-05-07T20:33:13.0420244Z 2025-05-07T20:33:13.0420425Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0421004Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0421454Z module_map=module_map) 2025-05-07T20:33:13.0421811Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0422158Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0422404Z E ^ 2025-05-07T20:33:13.0422860Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0423307Z 2025-05-07T20:33:13.0423714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0424212Z 2025-05-07T20:33:13.0424316Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0424712Z self=, 2025-05-07T20:33:13.0425099Z T=2048, 2025-05-07T20:33:13.0425328Z D=7168, 2025-05-07T20:33:13.0425515Z scale_ub=1200.0, 2025-05-07T20:33:13.0425730Z contiguous=True, 2025-05-07T20:33:13.0425939Z compiled=False, 2025-05-07T20:33:13.0426132Z ) 2025-05-07T20:33:13.0426442Z self = 2025-05-07T20:33:13.0426923Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:13.0427187Z 2025-05-07T20:33:13.0427308Z @given( 2025-05-07T20:33:13.0427526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0427831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0428131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0428443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0428763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0429038Z ) 2025-05-07T20:33:13.0429374Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0429862Z def test_silu_mul_quant( 2025-05-07T20:33:13.0430120Z self, 2025-05-07T20:33:13.0430326Z T: int, 2025-05-07T20:33:13.0430511Z D: int, 2025-05-07T20:33:13.0430721Z scale_ub: Optional[float], 2025-05-07T20:33:13.0430980Z contiguous: bool, 2025-05-07T20:33:13.0431206Z compiled: bool, 2025-05-07T20:33:13.0431418Z ) -> None: 2025-05-07T20:33:13.0431622Z torch.manual_seed(2025) 2025-05-07T20:33:13.0431851Z 2025-05-07T20:33:13.0432111Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0434129Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0435956Z 2025-05-07T20:33:13.0436076Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0436284Z 2025-05-07T20:33:13.0436383Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0436775Z self=, 2025-05-07T20:33:13.0437164Z T=1, 2025-05-07T20:33:13.0437338Z D=5120, 2025-05-07T20:33:13.0437513Z scale_ub=1200.0, 2025-05-07T20:33:13.0437725Z contiguous=True, 2025-05-07T20:33:13.0437940Z compiled=False, 2025-05-07T20:33:13.0438129Z ) 2025-05-07T20:33:13.0438435Z self = 2025-05-07T20:33:13.0438906Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:13.0439162Z 2025-05-07T20:33:13.0439232Z @given( 2025-05-07T20:33:13.0439506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0439809Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0440108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0440476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0440790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0441064Z ) 2025-05-07T20:33:13.0441401Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0441826Z def test_silu_mul_quant( 2025-05-07T20:33:13.0442057Z self, 2025-05-07T20:33:13.0442237Z T: int, 2025-05-07T20:33:13.0442423Z D: int, 2025-05-07T20:33:13.0442631Z scale_ub: Optional[float], 2025-05-07T20:33:13.0442889Z contiguous: bool, 2025-05-07T20:33:13.0443125Z compiled: bool, 2025-05-07T20:33:13.0443339Z ) -> None: 2025-05-07T20:33:13.0443543Z torch.manual_seed(2025) 2025-05-07T20:33:13.0443830Z 2025-05-07T20:33:13.0444095Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0444424Z 2025-05-07T20:33:13.0444608Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0444891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0445190Z x = x_sign * x_clamp 2025-05-07T20:33:13.0445422Z x0 = x[:, :D] 2025-05-07T20:33:13.0445635Z x1 = x[:, D:] 2025-05-07T20:33:13.0445877Z 2025-05-07T20:33:13.0446054Z if contiguous: 2025-05-07T20:33:13.0446277Z x0 = x0.contiguous() 2025-05-07T20:33:13.0446524Z x1 = x1.contiguous() 2025-05-07T20:33:13.0446756Z 2025-05-07T20:33:13.0446942Z if scale_ub is not None: 2025-05-07T20:33:13.0447206Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0447527Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0447828Z ) 2025-05-07T20:33:13.0448018Z else: 2025-05-07T20:33:13.0448266Z scale_ub_tensor = None 2025-05-07T20:33:13.0448507Z 2025-05-07T20:33:13.0448731Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0449035Z op = silu_mul_quant 2025-05-07T20:33:13.0449274Z if compiled: 2025-05-07T20:33:13.0449512Z op = torch.compile(op) 2025-05-07T20:33:13.0449800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0450061Z 2025-05-07T20:33:13.0450245Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0450405Z 2025-05-07T20:33:13.0450501Z moe/activation_test.py:117: 2025-05-07T20:33:13.0450785Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0451104Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0451368Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0452041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0452719Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0453342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0454012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0454658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0455179Z kernel = self.compile( 2025-05-07T20:33:13.0455711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0456346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0456731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0456954Z 2025-05-07T20:33:13.0457159Z self = 2025-05-07T20:33:13.0458220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0459941Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1a21a80>} 2025-05-07T20:33:13.0461263Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0462264Z context = 2025-05-07T20:33:13.0462542Z 2025-05-07T20:33:13.0462707Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0463306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0463767Z module_map=module_map) 2025-05-07T20:33:13.0464123Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0464469Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0464722Z E ^ 2025-05-07T20:33:13.0465176Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0465679Z 2025-05-07T20:33:13.0466091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0466590Z 2025-05-07T20:33:13.0466693Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0467098Z self=, 2025-05-07T20:33:13.0467489Z T=2048, 2025-05-07T20:33:13.0467675Z D=5120, 2025-05-07T20:33:13.0467859Z scale_ub=None, 2025-05-07T20:33:13.0468068Z contiguous=True, 2025-05-07T20:33:13.0468356Z compiled=False, 2025-05-07T20:33:13.0468559Z ) 2025-05-07T20:33:13.0468872Z self = 2025-05-07T20:33:13.0469351Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:13.0469619Z 2025-05-07T20:33:13.0469696Z @given( 2025-05-07T20:33:13.0469917Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0470229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0470525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0470845Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0471168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0471443Z ) 2025-05-07T20:33:13.0471782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0472214Z def test_silu_mul_quant( 2025-05-07T20:33:13.0472449Z self, 2025-05-07T20:33:13.0472644Z T: int, 2025-05-07T20:33:13.0472844Z D: int, 2025-05-07T20:33:13.0473054Z scale_ub: Optional[float], 2025-05-07T20:33:13.0473320Z contiguous: bool, 2025-05-07T20:33:13.0473553Z compiled: bool, 2025-05-07T20:33:13.0473776Z ) -> None: 2025-05-07T20:33:13.0473981Z torch.manual_seed(2025) 2025-05-07T20:33:13.0474221Z 2025-05-07T20:33:13.0474484Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0474818Z 2025-05-07T20:33:13.0475011Z > x_sign = torch.sign(x) 2025-05-07T20:33:13.0476936Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0478827Z 2025-05-07T20:33:13.0478946Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:13.0479152Z 2025-05-07T20:33:13.0479253Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0479655Z self=, 2025-05-07T20:33:13.0480053Z T=16384, 2025-05-07T20:33:13.0480269Z D=5120, 2025-05-07T20:33:13.0480480Z scale_ub=None, 2025-05-07T20:33:13.0480687Z contiguous=True, 2025-05-07T20:33:13.0480904Z compiled=False, 2025-05-07T20:33:13.0481101Z ) 2025-05-07T20:33:13.0481411Z self = 2025-05-07T20:33:13.0481893Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:13.0482162Z 2025-05-07T20:33:13.0482238Z @given( 2025-05-07T20:33:13.0482508Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0482818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0483114Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0483435Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0483754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0484032Z ) 2025-05-07T20:33:13.0484371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0484843Z def test_silu_mul_quant( 2025-05-07T20:33:13.0485076Z self, 2025-05-07T20:33:13.0485263Z T: int, 2025-05-07T20:33:13.0485457Z D: int, 2025-05-07T20:33:13.0485669Z scale_ub: Optional[float], 2025-05-07T20:33:13.0485927Z contiguous: bool, 2025-05-07T20:33:13.0486160Z compiled: bool, 2025-05-07T20:33:13.0486378Z ) -> None: 2025-05-07T20:33:13.0486581Z torch.manual_seed(2025) 2025-05-07T20:33:13.0486815Z 2025-05-07T20:33:13.0487215Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0489220Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0491043Z 2025-05-07T20:33:13.0491162Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0491370Z 2025-05-07T20:33:13.0491469Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0491873Z self=, 2025-05-07T20:33:13.0492265Z T=4096, 2025-05-07T20:33:13.0492449Z D=5120, 2025-05-07T20:33:13.0492638Z scale_ub=None, 2025-05-07T20:33:13.0492851Z contiguous=True, 2025-05-07T20:33:13.0493116Z compiled=False, 2025-05-07T20:33:13.0493312Z ) 2025-05-07T20:33:13.0493622Z self = 2025-05-07T20:33:13.0494103Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:13.0494368Z 2025-05-07T20:33:13.0494443Z @given( 2025-05-07T20:33:13.0494667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0494971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0495266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0495585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0495904Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0496175Z ) 2025-05-07T20:33:13.0496521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0497002Z def test_silu_mul_quant( 2025-05-07T20:33:13.0497232Z self, 2025-05-07T20:33:13.0497424Z T: int, 2025-05-07T20:33:13.0497615Z D: int, 2025-05-07T20:33:13.0497827Z scale_ub: Optional[float], 2025-05-07T20:33:13.0498089Z contiguous: bool, 2025-05-07T20:33:13.0498326Z compiled: bool, 2025-05-07T20:33:13.0498546Z ) -> None: 2025-05-07T20:33:13.0498754Z torch.manual_seed(2025) 2025-05-07T20:33:13.0498991Z 2025-05-07T20:33:13.0499252Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0501291Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0503108Z 2025-05-07T20:33:13.0503224Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0503434Z 2025-05-07T20:33:13.0503533Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0503976Z self=, 2025-05-07T20:33:13.0504369Z T=2048, 2025-05-07T20:33:13.0504546Z D=5120, 2025-05-07T20:33:13.0504730Z scale_ub=None, 2025-05-07T20:33:13.0504942Z contiguous=False, 2025-05-07T20:33:13.0505162Z compiled=False, 2025-05-07T20:33:13.0505365Z ) 2025-05-07T20:33:13.0505676Z self = 2025-05-07T20:33:13.0506152Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:13.0506419Z 2025-05-07T20:33:13.0506547Z @given( 2025-05-07T20:33:13.0506772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0507073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0507375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0507694Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0508011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0508288Z ) 2025-05-07T20:33:13.0508630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0509059Z def test_silu_mul_quant( 2025-05-07T20:33:13.0509291Z self, 2025-05-07T20:33:13.0509481Z T: int, 2025-05-07T20:33:13.0509678Z D: int, 2025-05-07T20:33:13.0509891Z scale_ub: Optional[float], 2025-05-07T20:33:13.0510182Z contiguous: bool, 2025-05-07T20:33:13.0510440Z compiled: bool, 2025-05-07T20:33:13.0510654Z ) -> None: 2025-05-07T20:33:13.0510869Z torch.manual_seed(2025) 2025-05-07T20:33:13.0511108Z 2025-05-07T20:33:13.0511369Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0513366Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0515183Z 2025-05-07T20:33:13.0515300Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0515515Z 2025-05-07T20:33:13.0515614Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0516022Z self=, 2025-05-07T20:33:13.0516458Z T=4096, 2025-05-07T20:33:13.0516648Z D=7168, 2025-05-07T20:33:13.0516834Z scale_ub=None, 2025-05-07T20:33:13.0517038Z contiguous=True, 2025-05-07T20:33:13.0517254Z compiled=True, 2025-05-07T20:33:13.0517453Z ) 2025-05-07T20:33:13.0517759Z self = 2025-05-07T20:33:13.0518237Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.0518500Z 2025-05-07T20:33:13.0518577Z @given( 2025-05-07T20:33:13.0518799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0519098Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0519393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0519713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0520029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0520310Z ) 2025-05-07T20:33:13.0520735Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0521175Z def test_silu_mul_quant( 2025-05-07T20:33:13.0521410Z self, 2025-05-07T20:33:13.0521598Z T: int, 2025-05-07T20:33:13.0521791Z D: int, 2025-05-07T20:33:13.0522001Z scale_ub: Optional[float], 2025-05-07T20:33:13.0522267Z contiguous: bool, 2025-05-07T20:33:13.0522543Z compiled: bool, 2025-05-07T20:33:13.0522758Z ) -> None: 2025-05-07T20:33:13.0522969Z torch.manual_seed(2025) 2025-05-07T20:33:13.0523209Z 2025-05-07T20:33:13.0523466Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0525506Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0527330Z 2025-05-07T20:33:13.0527445Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0527658Z 2025-05-07T20:33:13.0527764Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0528167Z self=, 2025-05-07T20:33:13.0528551Z T=2048, 2025-05-07T20:33:13.0528735Z D=5120, 2025-05-07T20:33:13.0528921Z scale_ub=1200.0, 2025-05-07T20:33:13.0529134Z contiguous=False, 2025-05-07T20:33:13.0529359Z compiled=False, 2025-05-07T20:33:13.0529555Z ) 2025-05-07T20:33:13.0529860Z self = 2025-05-07T20:33:13.0530346Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:13.0530618Z 2025-05-07T20:33:13.0530697Z @given( 2025-05-07T20:33:13.0530913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0531215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0531511Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0531828Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0532256Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0532533Z ) 2025-05-07T20:33:13.0532871Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0533344Z def test_silu_mul_quant( 2025-05-07T20:33:13.0533581Z self, 2025-05-07T20:33:13.0533768Z T: int, 2025-05-07T20:33:13.0533958Z D: int, 2025-05-07T20:33:13.0534171Z scale_ub: Optional[float], 2025-05-07T20:33:13.0534434Z contiguous: bool, 2025-05-07T20:33:13.0534663Z compiled: bool, 2025-05-07T20:33:13.0534942Z ) -> None: 2025-05-07T20:33:13.0535152Z torch.manual_seed(2025) 2025-05-07T20:33:13.0535382Z 2025-05-07T20:33:13.0535646Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0537638Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0539451Z 2025-05-07T20:33:13.0539566Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0539770Z 2025-05-07T20:33:13.0539917Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0540322Z self=, 2025-05-07T20:33:13.0540712Z T=4096, 2025-05-07T20:33:13.0540895Z D=7168, 2025-05-07T20:33:13.0541078Z scale_ub=1200.0, 2025-05-07T20:33:13.0541300Z contiguous=True, 2025-05-07T20:33:13.0541517Z compiled=False, 2025-05-07T20:33:13.0546173Z ) 2025-05-07T20:33:13.0546505Z self = 2025-05-07T20:33:13.0547061Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:13.0547334Z 2025-05-07T20:33:13.0547407Z @given( 2025-05-07T20:33:13.0547636Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0547941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0548242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0548568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0548944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0549232Z ) 2025-05-07T20:33:13.0549582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0550034Z def test_silu_mul_quant( 2025-05-07T20:33:13.0550292Z self, 2025-05-07T20:33:13.0550474Z T: int, 2025-05-07T20:33:13.0550665Z D: int, 2025-05-07T20:33:13.0550875Z scale_ub: Optional[float], 2025-05-07T20:33:13.0551139Z contiguous: bool, 2025-05-07T20:33:13.0551367Z compiled: bool, 2025-05-07T20:33:13.0551587Z ) -> None: 2025-05-07T20:33:13.0551798Z torch.manual_seed(2025) 2025-05-07T20:33:13.0552040Z 2025-05-07T20:33:13.0552304Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0554319Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0556148Z 2025-05-07T20:33:13.0556265Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0556479Z 2025-05-07T20:33:13.0556583Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0556990Z self=, 2025-05-07T20:33:13.0557382Z T=16384, 2025-05-07T20:33:13.0557569Z D=7168, 2025-05-07T20:33:13.0557758Z scale_ub=None, 2025-05-07T20:33:13.0557969Z contiguous=False, 2025-05-07T20:33:13.0558192Z compiled=True, 2025-05-07T20:33:13.0558385Z ) 2025-05-07T20:33:13.0558696Z self = 2025-05-07T20:33:13.0559466Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:13.0559789Z 2025-05-07T20:33:13.0559864Z @given( 2025-05-07T20:33:13.0560092Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0560417Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0560736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0561054Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0561369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0561644Z ) 2025-05-07T20:33:13.0561980Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0562405Z def test_silu_mul_quant( 2025-05-07T20:33:13.0562636Z self, 2025-05-07T20:33:13.0562818Z T: int, 2025-05-07T20:33:13.0563003Z D: int, 2025-05-07T20:33:13.0563212Z scale_ub: Optional[float], 2025-05-07T20:33:13.0563572Z contiguous: bool, 2025-05-07T20:33:13.0563810Z compiled: bool, 2025-05-07T20:33:13.0564015Z ) -> None: 2025-05-07T20:33:13.0564222Z torch.manual_seed(2025) 2025-05-07T20:33:13.0564449Z 2025-05-07T20:33:13.0564704Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0566715Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0568628Z 2025-05-07T20:33:13.0568747Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0569011Z 2025-05-07T20:33:13.0569115Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0569508Z self=, 2025-05-07T20:33:13.0569893Z T=4096, 2025-05-07T20:33:13.0570073Z D=7168, 2025-05-07T20:33:13.0570255Z scale_ub=None, 2025-05-07T20:33:13.0570452Z contiguous=True, 2025-05-07T20:33:13.0570665Z compiled=False, 2025-05-07T20:33:13.0570862Z ) 2025-05-07T20:33:13.0571166Z self = 2025-05-07T20:33:13.0571641Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:13.0571901Z 2025-05-07T20:33:13.0571980Z @given( 2025-05-07T20:33:13.0572192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0572494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0572784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0573195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0573517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0573794Z ) 2025-05-07T20:33:13.0574131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0574553Z def test_silu_mul_quant( 2025-05-07T20:33:13.0574782Z self, 2025-05-07T20:33:13.0574966Z T: int, 2025-05-07T20:33:13.0575152Z D: int, 2025-05-07T20:33:13.0575362Z scale_ub: Optional[float], 2025-05-07T20:33:13.0575624Z contiguous: bool, 2025-05-07T20:33:13.0575850Z compiled: bool, 2025-05-07T20:33:13.0576062Z ) -> None: 2025-05-07T20:33:13.0576268Z torch.manual_seed(2025) 2025-05-07T20:33:13.0576494Z 2025-05-07T20:33:13.0576752Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0578756Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0580641Z 2025-05-07T20:33:13.0580755Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0580959Z 2025-05-07T20:33:13.0581063Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0581454Z self=, 2025-05-07T20:33:13.0581843Z T=16384, 2025-05-07T20:33:13.0582030Z D=7168, 2025-05-07T20:33:13.0582206Z scale_ub=None, 2025-05-07T20:33:13.0582408Z contiguous=True, 2025-05-07T20:33:13.0582617Z compiled=False, 2025-05-07T20:33:13.0582813Z ) 2025-05-07T20:33:13.0583171Z self = 2025-05-07T20:33:13.0583652Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:13.0583922Z 2025-05-07T20:33:13.0584000Z @given( 2025-05-07T20:33:13.0584211Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0584510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0584850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0585167Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0585489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0585772Z ) 2025-05-07T20:33:13.0586101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0586529Z def test_silu_mul_quant( 2025-05-07T20:33:13.0586758Z self, 2025-05-07T20:33:13.0586938Z T: int, 2025-05-07T20:33:13.0587124Z D: int, 2025-05-07T20:33:13.0587381Z scale_ub: Optional[float], 2025-05-07T20:33:13.0587640Z contiguous: bool, 2025-05-07T20:33:13.0587864Z compiled: bool, 2025-05-07T20:33:13.0588079Z ) -> None: 2025-05-07T20:33:13.0588281Z torch.manual_seed(2025) 2025-05-07T20:33:13.0588507Z 2025-05-07T20:33:13.0588769Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0590822Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0592642Z 2025-05-07T20:33:13.0592762Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0592967Z 2025-05-07T20:33:13.0593063Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0593460Z self=, 2025-05-07T20:33:13.0593846Z T=16384, 2025-05-07T20:33:13.0594036Z D=7168, 2025-05-07T20:33:13.0594216Z scale_ub=1200.0, 2025-05-07T20:33:13.0594436Z contiguous=True, 2025-05-07T20:33:13.0594645Z compiled=False, 2025-05-07T20:33:13.0594832Z ) 2025-05-07T20:33:13.0595136Z self = 2025-05-07T20:33:13.0595614Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:13.0595884Z 2025-05-07T20:33:13.0595956Z @given( 2025-05-07T20:33:13.0596174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0596473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0596768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0597130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0597443Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0597716Z ) 2025-05-07T20:33:13.0598052Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0598476Z def test_silu_mul_quant( 2025-05-07T20:33:13.0598713Z self, 2025-05-07T20:33:13.0598894Z T: int, 2025-05-07T20:33:13.0599081Z D: int, 2025-05-07T20:33:13.0599286Z scale_ub: Optional[float], 2025-05-07T20:33:13.0599540Z contiguous: bool, 2025-05-07T20:33:13.0599632Z compiled: bool, 2025-05-07T20:33:13.0599707Z ) -> None: 2025-05-07T20:33:13.0599797Z torch.manual_seed(2025) 2025-05-07T20:33:13.0599871Z 2025-05-07T20:33:13.0600033Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0601877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0601920Z 2025-05-07T20:33:13.0602033Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0602038Z 2025-05-07T20:33:13.0602137Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0602351Z self=, 2025-05-07T20:33:13.0602425Z T=128, 2025-05-07T20:33:13.0602502Z D=5120, 2025-05-07T20:33:13.0602584Z scale_ub=1200.0, 2025-05-07T20:33:13.0602663Z contiguous=False, 2025-05-07T20:33:13.0602786Z compiled=False, 2025-05-07T20:33:13.0602856Z ) 2025-05-07T20:33:13.0603067Z self = 2025-05-07T20:33:13.0603236Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:13.0603240Z 2025-05-07T20:33:13.0603311Z @given( 2025-05-07T20:33:13.0603427Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0603523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0603631Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0603746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0603855Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0603927Z ) 2025-05-07T20:33:13.0604169Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0604257Z def test_silu_mul_quant( 2025-05-07T20:33:13.0604331Z self, 2025-05-07T20:33:13.0604411Z T: int, 2025-05-07T20:33:13.0604486Z D: int, 2025-05-07T20:33:13.0604580Z scale_ub: Optional[float], 2025-05-07T20:33:13.0604669Z contiguous: bool, 2025-05-07T20:33:13.0604748Z compiled: bool, 2025-05-07T20:33:13.0604826Z ) -> None: 2025-05-07T20:33:13.0604919Z torch.manual_seed(2025) 2025-05-07T20:33:13.0604991Z 2025-05-07T20:33:13.0605152Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0605232Z 2025-05-07T20:33:13.0605322Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0605444Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0605530Z x = x_sign * x_clamp 2025-05-07T20:33:13.0605608Z x0 = x[:, :D] 2025-05-07T20:33:13.0605682Z x1 = x[:, D:] 2025-05-07T20:33:13.0605757Z 2025-05-07T20:33:13.0605837Z if contiguous: 2025-05-07T20:33:13.0605931Z x0 = x0.contiguous() 2025-05-07T20:33:13.0606019Z x1 = x1.contiguous() 2025-05-07T20:33:13.0606138Z 2025-05-07T20:33:13.0606225Z if scale_ub is not None: 2025-05-07T20:33:13.0606327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0606457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0606531Z ) 2025-05-07T20:33:13.0606604Z else: 2025-05-07T20:33:13.0606698Z scale_ub_tensor = None 2025-05-07T20:33:13.0606774Z 2025-05-07T20:33:13.0606899Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0606984Z op = silu_mul_quant 2025-05-07T20:33:13.0607068Z if compiled: 2025-05-07T20:33:13.0607162Z op = torch.compile(op) 2025-05-07T20:33:13.0607265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0607330Z 2025-05-07T20:33:13.0607417Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0607422Z 2025-05-07T20:33:13.0607516Z moe/activation_test.py:117: 2025-05-07T20:33:13.0607686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0607786Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0607884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0608379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0608473Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0608868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0609084Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0609420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0609511Z kernel = self.compile( 2025-05-07T20:33:13.0609887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0610134Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0610279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0610283Z 2025-05-07T20:33:13.0610487Z self = 2025-05-07T20:33:13.0611253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0611751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a14107c0>} 2025-05-07T20:33:13.0612489Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0612677Z context = 2025-05-07T20:33:13.0612681Z 2025-05-07T20:33:13.0612843Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0613160Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0613264Z module_map=module_map) 2025-05-07T20:33:13.0613431Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0613524Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0613598Z E ^ 2025-05-07T20:33:13.0613942Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0613947Z 2025-05-07T20:33:13.0614350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0614354Z 2025-05-07T20:33:13.0614503Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0614719Z self=, 2025-05-07T20:33:13.0614792Z T=2048, 2025-05-07T20:33:13.0614867Z D=7168, 2025-05-07T20:33:13.0614944Z scale_ub=None, 2025-05-07T20:33:13.0615029Z contiguous=False, 2025-05-07T20:33:13.0615115Z compiled=False, 2025-05-07T20:33:13.0615187Z ) 2025-05-07T20:33:13.0615405Z self = 2025-05-07T20:33:13.0615571Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:13.0615576Z 2025-05-07T20:33:13.0615648Z @given( 2025-05-07T20:33:13.0615766Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0615861Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0615973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0616089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0616269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0616342Z ) 2025-05-07T20:33:13.0616584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0616675Z def test_silu_mul_quant( 2025-05-07T20:33:13.0616753Z self, 2025-05-07T20:33:13.0616824Z T: int, 2025-05-07T20:33:13.0616897Z D: int, 2025-05-07T20:33:13.0617043Z scale_ub: Optional[float], 2025-05-07T20:33:13.0617128Z contiguous: bool, 2025-05-07T20:33:13.0617209Z compiled: bool, 2025-05-07T20:33:13.0617288Z ) -> None: 2025-05-07T20:33:13.0617378Z torch.manual_seed(2025) 2025-05-07T20:33:13.0617449Z 2025-05-07T20:33:13.0617615Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0619409Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0619420Z 2025-05-07T20:33:13.0619538Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0619542Z 2025-05-07T20:33:13.0619641Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0619863Z self=, 2025-05-07T20:33:13.0619940Z T=128, 2025-05-07T20:33:13.0620010Z D=7168, 2025-05-07T20:33:13.0620092Z scale_ub=1200.0, 2025-05-07T20:33:13.0620182Z contiguous=True, 2025-05-07T20:33:13.0620274Z compiled=True, 2025-05-07T20:33:13.0620352Z ) 2025-05-07T20:33:13.0620586Z self = 2025-05-07T20:33:13.0620749Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:13.0620753Z 2025-05-07T20:33:13.0620828Z @given( 2025-05-07T20:33:13.0620938Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0621037Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0621145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0621259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0621371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0621442Z ) 2025-05-07T20:33:13.0621682Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0621775Z def test_silu_mul_quant( 2025-05-07T20:33:13.0621848Z self, 2025-05-07T20:33:13.0621920Z T: int, 2025-05-07T20:33:13.0621995Z D: int, 2025-05-07T20:33:13.0622092Z scale_ub: Optional[float], 2025-05-07T20:33:13.0622225Z contiguous: bool, 2025-05-07T20:33:13.0622309Z compiled: bool, 2025-05-07T20:33:13.0622382Z ) -> None: 2025-05-07T20:33:13.0622476Z torch.manual_seed(2025) 2025-05-07T20:33:13.0622546Z 2025-05-07T20:33:13.0622709Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0622781Z 2025-05-07T20:33:13.0622871Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0622996Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0623088Z x = x_sign * x_clamp 2025-05-07T20:33:13.0623166Z x0 = x[:, :D] 2025-05-07T20:33:13.0623243Z x1 = x[:, D:] 2025-05-07T20:33:13.0623320Z 2025-05-07T20:33:13.0623401Z if contiguous: 2025-05-07T20:33:13.0623492Z x0 = x0.contiguous() 2025-05-07T20:33:13.0623581Z x1 = x1.contiguous() 2025-05-07T20:33:13.0623655Z 2025-05-07T20:33:13.0623744Z if scale_ub is not None: 2025-05-07T20:33:13.0623894Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:13.0624027Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:13.0624105Z ) 2025-05-07T20:33:13.0624178Z else: 2025-05-07T20:33:13.0624271Z scale_ub_tensor = None 2025-05-07T20:33:13.0624344Z 2025-05-07T20:33:13.0624469Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:13.0624597Z op = silu_mul_quant 2025-05-07T20:33:13.0624679Z if compiled: 2025-05-07T20:33:13.0624776Z op = torch.compile(op) 2025-05-07T20:33:13.0624876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0624950Z 2025-05-07T20:33:13.0625037Z > y_fp8, y_scale = fn() 2025-05-07T20:33:13.0625042Z 2025-05-07T20:33:13.0625142Z moe/activation_test.py:117: 2025-05-07T20:33:13.0625265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0625361Z moe/activation_test.py:115: in fn 2025-05-07T20:33:13.0625508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:13.0625873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:13.0625961Z return fn(*args, **kwargs) 2025-05-07T20:33:13.0626447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:13.0626545Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:13.0626899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:13.0627119Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:13.0627451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:13.0627545Z kernel = self.compile( 2025-05-07T20:33:13.0627924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:13.0628095Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:13.0628221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:13.0628225Z 2025-05-07T20:33:13.0628428Z self = 2025-05-07T20:33:13.0629195Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:13.0629689Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f03a1411940>} 2025-05-07T20:33:13.0630445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:13.0630706Z context = 2025-05-07T20:33:13.0630711Z 2025-05-07T20:33:13.0630869Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:13.0631129Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:13.0631236Z module_map=module_map) 2025-05-07T20:33:13.0631394Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:13.0631494Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:13.0631567Z E ^ 2025-05-07T20:33:13.0631916Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:13.0631920Z 2025-05-07T20:33:13.0632368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:13.0632376Z 2025-05-07T20:33:13.0632473Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0632692Z self=, 2025-05-07T20:33:13.0632768Z T=128, 2025-05-07T20:33:13.0632844Z D=7168, 2025-05-07T20:33:13.0632926Z scale_ub=1200.0, 2025-05-07T20:33:13.0633007Z contiguous=True, 2025-05-07T20:33:13.0633131Z compiled=False, 2025-05-07T20:33:13.0633203Z ) 2025-05-07T20:33:13.0633412Z self = 2025-05-07T20:33:13.0633578Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:13.0633583Z 2025-05-07T20:33:13.0633657Z @given( 2025-05-07T20:33:13.0633771Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0633869Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0633980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0634140Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0634253Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0634324Z ) 2025-05-07T20:33:13.0634568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0634657Z def test_silu_mul_quant( 2025-05-07T20:33:13.0634728Z self, 2025-05-07T20:33:13.0634806Z T: int, 2025-05-07T20:33:13.0634882Z D: int, 2025-05-07T20:33:13.0634977Z scale_ub: Optional[float], 2025-05-07T20:33:13.0635067Z contiguous: bool, 2025-05-07T20:33:13.0635148Z compiled: bool, 2025-05-07T20:33:13.0635223Z ) -> None: 2025-05-07T20:33:13.0635317Z torch.manual_seed(2025) 2025-05-07T20:33:13.0635388Z 2025-05-07T20:33:13.0635556Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0635630Z 2025-05-07T20:33:13.0635723Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0635857Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0637605Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0637613Z 2025-05-07T20:33:13.0637730Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:13.0637734Z 2025-05-07T20:33:13.0637832Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0638045Z self=, 2025-05-07T20:33:13.0638120Z T=128, 2025-05-07T20:33:13.0638240Z D=5120, 2025-05-07T20:33:13.0638323Z scale_ub=1200.0, 2025-05-07T20:33:13.0638406Z contiguous=True, 2025-05-07T20:33:13.0638485Z compiled=True, 2025-05-07T20:33:13.0638557Z ) 2025-05-07T20:33:13.0638769Z self = 2025-05-07T20:33:13.0638929Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:13.0638936Z 2025-05-07T20:33:13.0639013Z @given( 2025-05-07T20:33:13.0639123Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0639219Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0639333Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0639446Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0639553Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0639628Z ) 2025-05-07T20:33:13.0639873Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0640033Z def test_silu_mul_quant( 2025-05-07T20:33:13.0640122Z self, 2025-05-07T20:33:13.0640205Z T: int, 2025-05-07T20:33:13.0640283Z D: int, 2025-05-07T20:33:13.0640376Z scale_ub: Optional[float], 2025-05-07T20:33:13.0640461Z contiguous: bool, 2025-05-07T20:33:13.0640547Z compiled: bool, 2025-05-07T20:33:13.0640621Z ) -> None: 2025-05-07T20:33:13.0640710Z torch.manual_seed(2025) 2025-05-07T20:33:13.0640830Z 2025-05-07T20:33:13.0640994Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0641068Z 2025-05-07T20:33:13.0641158Z x_sign = torch.sign(x) 2025-05-07T20:33:13.0641280Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:13.0643056Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0643064Z 2025-05-07T20:33:13.0643177Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:13.0643185Z 2025-05-07T20:33:13.0643285Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:13.0643501Z self=, 2025-05-07T20:33:13.0643577Z T=128, 2025-05-07T20:33:13.0643652Z D=7168, 2025-05-07T20:33:13.0643733Z scale_ub=None, 2025-05-07T20:33:13.0643816Z contiguous=True, 2025-05-07T20:33:13.0643899Z compiled=True, 2025-05-07T20:33:13.0643970Z ) 2025-05-07T20:33:13.0644183Z self = 2025-05-07T20:33:13.0644351Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:13.0644356Z 2025-05-07T20:33:13.0644431Z @given( 2025-05-07T20:33:13.0644548Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:13.0644645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:13.0644756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:13.0644873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:13.0644982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:13.0645053Z ) 2025-05-07T20:33:13.0645293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:13.0645381Z def test_silu_mul_quant( 2025-05-07T20:33:13.0645454Z self, 2025-05-07T20:33:13.0645527Z T: int, 2025-05-07T20:33:13.0645598Z D: int, 2025-05-07T20:33:13.0645697Z scale_ub: Optional[float], 2025-05-07T20:33:13.0645787Z contiguous: bool, 2025-05-07T20:33:13.0645945Z compiled: bool, 2025-05-07T20:33:13.0646025Z ) -> None: 2025-05-07T20:33:13.0646117Z torch.manual_seed(2025) 2025-05-07T20:33:13.0646189Z 2025-05-07T20:33:13.0646355Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:13.0648086Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:13.0648094Z 2025-05-07T20:33:13.0648207Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:13.0648381Z =============================== warnings summary =============================== 2025-05-07T20:33:13.0648684Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:13.0648981Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:13.0649272Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:13.0650174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:13.0650396Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:13.0650401Z 2025-05-07T20:33:13.0650611Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:13.0650812Z ================= 1 failed, 1 deselected, 3 warnings in 13.06s ================= 2025-05-07T20:33:14.5652384Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:14.6271759Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:14.6272008Z 2025-05-07T20:33:14.6272966Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:14.6273570Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:14.6273972Z 2025-05-07T20:33:14.6273979Z 2025-05-07T20:33:14.6273984Z 2025-05-07T20:33:14.6289872Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:14.6377085Z Post job cleanup. 2025-05-07T20:33:14.7366852Z [command]/usr/bin/git version 2025-05-07T20:33:14.7411652Z git version 2.47.1 2025-05-07T20:33:14.7449910Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/7cd6b761-893f-479d-9f5d-dbf6b58edd7b/.gitconfig' 2025-05-07T20:33:14.7461195Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/7cd6b761-893f-479d-9f5d-dbf6b58edd7b' before making global git config changes 2025-05-07T20:33:14.7462033Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:14.7466959Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:14.7511039Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:14.7545674Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:14.7879797Z Entering 'external/asmjit' 2025-05-07T20:33:14.7946769Z Entering 'external/composable_kernel' 2025-05-07T20:33:14.8021281Z Entering 'external/cpuinfo' 2025-05-07T20:33:14.8087678Z Entering 'external/cutlass' 2025-05-07T20:33:14.8161678Z Entering 'external/googletest' 2025-05-07T20:33:14.8228853Z Entering 'external/hipify_torch' 2025-05-07T20:33:14.8294875Z Entering 'external/json' 2025-05-07T20:33:14.8383489Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:14.8408818Z http.https://github.com/.extraheader 2025-05-07T20:33:14.8420610Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:14.8452360Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:14.8779008Z Entering 'external/asmjit' 2025-05-07T20:33:14.8820675Z http.https://github.com/.extraheader 2025-05-07T20:33:14.8865407Z Entering 'external/composable_kernel' 2025-05-07T20:33:14.8908953Z http.https://github.com/.extraheader 2025-05-07T20:33:14.8959067Z Entering 'external/cpuinfo' 2025-05-07T20:33:14.9001980Z http.https://github.com/.extraheader 2025-05-07T20:33:14.9044795Z Entering 'external/cutlass' 2025-05-07T20:33:14.9088725Z http.https://github.com/.extraheader 2025-05-07T20:33:14.9140169Z Entering 'external/googletest' 2025-05-07T20:33:14.9182255Z http.https://github.com/.extraheader 2025-05-07T20:33:14.9224508Z Entering 'external/hipify_torch' 2025-05-07T20:33:14.9266306Z http.https://github.com/.extraheader 2025-05-07T20:33:14.9308836Z Entering 'external/json' 2025-05-07T20:33:14.9350141Z http.https://github.com/.extraheader 2025-05-07T20:33:14.9503778Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:14.9534602Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:14.9544948Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:14.9545317Z ##[endgroup] 2025-05-07T20:33:14.9642161Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:25.7194798Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:42.1539827Z Cleaning up orphan processes